-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Custom NER model with spacy #1453
Comments
I wouldn't really recommend using Rasa NLU to extract entities from documents/PDFs, it's for extracting entities from shorter sentences, mainly for chatbots. |
@akelad thanks for your note. Just in case you aware of any starting point (python script or any other NLU library) for, please share. |
i'm not really sure, the spacy documentation is maybe a good starting point. I'll close this issue for now |
@akelad Even for short sentences CRF BASED "ner_crf" is not giving accurate results. I have trained model to extract source and destination from short sentences like please find trained data in below attachment. BELOW IS MY CONFIG FILE. language: "en" pipeline:
-- |
* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle * update load and persist in sklearn intent classifier * update persist and load in dietclassifier * update load and persist in sklearn intent classifier * use json.dump and json.load in tracker featurizers * update persist and load of TEDPolicy * updated unexpected intent policy persist and load of model utilities. * save and load fake features * rename patterns.pkl to patterns.json * update poetry.lock * ruff formatting * move skops import * add comments * clean up save_features and load_features * WIP: update model data saving and loading * add tests for save and load features * update tests for test_tracker_featurizer * update tests for test_tracker_featurizer * WIP: serialization of feature arrays. * update serialization and deserialization for feature array * remove not needed tests/utils/tensorflow/test_model_data_storage.py * start writing tests for feature array * update feature array tests * update tests * fix linting * add changelog * add new dependencies to .github/dependabot.yml * fix some tests * fix loading and saving of unexpected intent ted policy * fix linting issue * fix converting of features in cvf and lsf * fix lint issues * convert vocab in cvf * fix linting * update crf entity extractor * fix to_dict of crf_token * addressed type issues * ruff formatting * fix typing and lint issues * remove cloudpickle dependency * update logistic_regression_classifier and remove joblib as dependency * update formatting of pyproject.toml * next try: update formatting of pyproject.toml * update logging * update poetry.lock * refactor loading of lexical_syntactic_featurizer * rename FeatureMetadata.type -> FeatureMetadata.data_type * clean up tests test_features.py and test_crf_entity_extractor.py * update test_feature_array.py * check for type when loading tracker featurizer. * update changelog * fix line too long * move import of skops * Prepared release of version 3.10.9.dev1 (#1496) * prepared release of version 3.10.9.dev1 * update minimum model version * Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack. * fix parsing of commands * improve logging * formatting * add changelog * fix parse commands for multi step * [ATO-2985] - Windows model loading test (#1537) * Add test for model loading on windows * Improve the error message logged when handling the user message * Add a changelog * Fix Code Quality - line too long * Rasa-sdk-update (#1546) * all rasa-sdk micro updates * update poetry lock * update rasa-sdk in lock file * Remove trailing white sapce * Prepared release of version 3.10.11 (#1570) * prepared release of version 3.10.11 * add comments again in pyproject.toml * update poetry.lock * revert changes in github workflows * undo changes in pyproject.toml * update changelog * revert changes in github workflows * update poetry.lock * update poetry.lock
* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle * update load and persist in sklearn intent classifier * update persist and load in dietclassifier * update load and persist in sklearn intent classifier * use json.dump and json.load in tracker featurizers * update persist and load of TEDPolicy * updated unexpected intent policy persist and load of model utilities. * save and load fake features * rename patterns.pkl to patterns.json * update poetry.lock * ruff formatting * move skops import * add comments * clean up save_features and load_features * WIP: update model data saving and loading * add tests for save and load features * update tests for test_tracker_featurizer * update tests for test_tracker_featurizer * WIP: serialization of feature arrays. * update serialization and deserialization for feature array * remove not needed tests/utils/tensorflow/test_model_data_storage.py * start writing tests for feature array * update feature array tests * update tests * fix linting * add changelog * add new dependencies to .github/dependabot.yml * fix some tests * fix loading and saving of unexpected intent ted policy * fix linting issue * fix converting of features in cvf and lsf * fix lint issues * convert vocab in cvf * fix linting * update crf entity extractor * fix to_dict of crf_token * addressed type issues * ruff formatting * fix typing and lint issues * remove cloudpickle dependency * update logistic_regression_classifier and remove joblib as dependency * update formatting of pyproject.toml * next try: update formatting of pyproject.toml * update logging * update poetry.lock * refactor loading of lexical_syntactic_featurizer * rename FeatureMetadata.type -> FeatureMetadata.data_type * clean up tests test_features.py and test_crf_entity_extractor.py * update test_feature_array.py * check for type when loading tracker featurizer. * update changelog * fix line too long * move import of skops * Prepared release of version 3.10.9.dev1 (#1496) * prepared release of version 3.10.9.dev1 * update minimum model version * Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack. * fix parsing of commands * improve logging * formatting * add changelog * fix parse commands for multi step * [ATO-2985] - Windows model loading test (#1537) * Add test for model loading on windows * Improve the error message logged when handling the user message * Add a changelog * Fix Code Quality - line too long * Rasa-sdk-update (#1546) * all rasa-sdk micro updates * update poetry lock * update rasa-sdk in lock file * Remove trailing white sapce * Prepared release of version 3.10.11 (#1570) * prepared release of version 3.10.11 * add comments again in pyproject.toml * update poetry.lock * revert changes in github workflows * undo changes in pyproject.toml * update changelog * revert changes in github workflows * update poetry.lock * update poetry.lock
* Update slack release notification step * [ENG-1424] Use `pickle` alternatives (#1453) * use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle * update load and persist in sklearn intent classifier * update persist and load in dietclassifier * update load and persist in sklearn intent classifier * use json.dump and json.load in tracker featurizers * update persist and load of TEDPolicy * updated unexpected intent policy persist and load of model utilities. * save and load fake features * rename patterns.pkl to patterns.json * update poetry.lock * ruff formatting * move skops import * add comments * clean up save_features and load_features * WIP: update model data saving and loading * add tests for save and load features * update tests for test_tracker_featurizer * update tests for test_tracker_featurizer * WIP: serialization of feature arrays. * update serialization and deserialization for feature array * remove not needed tests/utils/tensorflow/test_model_data_storage.py * start writing tests for feature array * update feature array tests * update tests * fix linting * add changelog * add new dependencies to .github/dependabot.yml * fix some tests * fix loading and saving of unexpected intent ted policy * fix linting issue * fix converting of features in cvf and lsf * fix lint issues * convert vocab in cvf * fix linting * update crf entity extractor * fix to_dict of crf_token * addressed type issues * ruff formatting * fix typing and lint issues * remove cloudpickle dependency * update logistic_regression_classifier and remove joblib as dependency * update formatting of pyproject.toml * next try: update formatting of pyproject.toml * update logging * update poetry.lock * refactor loading of lexical_syntactic_featurizer * rename FeatureMetadata.type -> FeatureMetadata.data_type * clean up tests test_features.py and test_crf_entity_extractor.py * update test_feature_array.py * check for type when loading tracker featurizer. * update changelog * fix line too long * move import of skops * Prepared release of version 3.10.9.dev1 (#1496) * prepared release of version 3.10.9.dev1 * update minimum model version * Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack. * fix parsing of commands * improve logging * formatting * add changelog * fix parse commands for multi step * [ATO-2985] - Windows model loading test (#1537) * Add test for model loading on windows * Improve the error message logged when handling the user message * Add a changelog * Fix Code Quality - line too long * Rasa-sdk-update (#1546) * all rasa-sdk micro updates * update poetry lock * update rasa-sdk in lock file * Remove trailing white sapce * Prepared release of version 3.10.11 (#1570) * prepared release of version 3.10.11 * add comments again in pyproject.toml * update poetry.lock * revert changes in github workflows * undo changes in pyproject.toml * update changelog * revert changes in github workflows * update poetry.lock * update poetry.lock * update pyproject.toml * update poetry.lock * update setuptools = '>=65.5.1,<75.6.0' * update setuptools = '~75.3.0' * reformat code * undo deleting of ping_slack_about_package_release.sh * fix formatting and type issues * downgrade setuptools to 70.3.0 * fixing logging issues (?) --------- Co-authored-by: sancharigr <[email protected]>
Rasa NLU version: 0.13.4
As of now for creating a custom NER model, only option is CRF. I have multiple use cases where-in I need to extract data from documents / Pdfs. I haven't seen good results so far using CRF. I wanted to try spacy custom model.
As of now, I am creating spacy model via python script provided by spacy but I thought it will be good to get option out of box in RASA NLU to create custom spacy model.
The text was updated successfully, but these errors were encountered: