Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Custom NER model with spacy #1453

Closed
kapilkathuria opened this issue Oct 9, 2018 · 4 comments
Closed

Feature request: Custom NER model with spacy #1453

kapilkathuria opened this issue Oct 9, 2018 · 4 comments

Comments

@kapilkathuria
Copy link

Rasa NLU version: 0.13.4

As of now for creating a custom NER model, only option is CRF. I have multiple use cases where-in I need to extract data from documents / Pdfs. I haven't seen good results so far using CRF. I wanted to try spacy custom model.

As of now, I am creating spacy model via python script provided by spacy but I thought it will be good to get option out of box in RASA NLU to create custom spacy model.

@akelad
Copy link
Contributor

akelad commented Oct 9, 2018

I wouldn't really recommend using Rasa NLU to extract entities from documents/PDFs, it's for extracting entities from shorter sentences, mainly for chatbots.
I think using your own python script for this is the correct approach in this case

@kapilkathuria
Copy link
Author

@akelad thanks for your note. Just in case you aware of any starting point (python script or any other NLU library) for, please share.

@akelad
Copy link
Contributor

akelad commented Oct 10, 2018

i'm not really sure, the spacy documentation is maybe a good starting point. I'll close this issue for now

@akelad akelad closed this as completed Oct 10, 2018
@prashant334
Copy link

prashant334 commented May 14, 2019

dn't really recommend using Rasa NLU to extract entities from documents/PDFs, it's for extracting entities from shorter sentences, main

@akelad Even for short sentences CRF BASED "ner_crf" is not giving accurate results. I have trained model to extract source and destination from short sentences like
show me the trains from Delhi to pune where DELHI = SOURCE and PUNE=DESTINATION

please find trained data in below attachment.

data.txt

BELOW IS MY CONFIG FILE.

language: "en"

pipeline:

  • name: "SpacyNLP"
  • name: "SpacyTokenizer"
    #- name: "SpacyFeaturizer"
    #- name: "RegexFeaturizer"
    #- name: "CRFEntityExtractor"
    #- name: "EntitySynonymMapper"
    #- name: "SklearnIntentClassifier"
  • name: "ner_crf"
  • name: "ner_spacy"

--

vcidst pushed a commit that referenced this issue Nov 29, 2024
* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle

* update load and persist in sklearn intent classifier

* update persist and load in dietclassifier

* update load and persist in sklearn intent classifier

* use json.dump and json.load in tracker featurizers

* update persist and load of TEDPolicy

* updated unexpected intent policy persist and load of model utilities.

* save and load fake features

* rename patterns.pkl to patterns.json

* update poetry.lock

* ruff formatting

* move skops import

* add comments

* clean up save_features and load_features

* WIP: update model data saving and loading

* add tests for save and load features

* update tests for test_tracker_featurizer

* update tests for test_tracker_featurizer

* WIP: serialization of feature arrays.

* update serialization and deserialization for feature array

* remove not needed tests/utils/tensorflow/test_model_data_storage.py

* start writing tests for feature array

* update feature array tests

* update tests

* fix linting

* add changelog

* add new dependencies to .github/dependabot.yml

* fix some tests

* fix loading and saving of unexpected intent ted policy

* fix linting issue

* fix converting of features in cvf and lsf

* fix lint issues

* convert vocab in cvf

* fix linting

* update crf entity extractor

* fix to_dict of crf_token

* addressed type issues

* ruff formatting

* fix typing and lint issues

* remove cloudpickle dependency

* update logistic_regression_classifier and remove joblib as dependency

* update formatting of pyproject.toml

* next try: update formatting of pyproject.toml

* update logging

* update poetry.lock

* refactor loading of lexical_syntactic_featurizer

* rename FeatureMetadata.type -> FeatureMetadata.data_type

* clean up tests test_features.py and test_crf_entity_extractor.py

* update test_feature_array.py

* check for type when loading tracker featurizer.

* update changelog

* fix line too long

* move import of skops

* Prepared release of version 3.10.9.dev1 (#1496)

* prepared release of version 3.10.9.dev1

* update minimum model version

* Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack.

* fix parsing of commands

* improve logging

* formatting

* add changelog

* fix parse commands for multi step

* [ATO-2985] - Windows model loading test (#1537)

* Add test for model loading on windows

* Improve the error message logged when handling the user message

* Add a changelog

* Fix Code Quality - line too long

* Rasa-sdk-update (#1546)

* all rasa-sdk micro updates

* update poetry lock

* update rasa-sdk in lock file

* Remove trailing white sapce

* Prepared release of version 3.10.11 (#1570)

* prepared release of version 3.10.11

* add comments again in pyproject.toml

* update poetry.lock

* revert changes in github workflows

* undo changes in pyproject.toml

* update changelog

* revert changes in github workflows

* update poetry.lock

* update poetry.lock
tabergma added a commit that referenced this issue Dec 20, 2024
* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle

* update load and persist in sklearn intent classifier

* update persist and load in dietclassifier

* update load and persist in sklearn intent classifier

* use json.dump and json.load in tracker featurizers

* update persist and load of TEDPolicy

* updated unexpected intent policy persist and load of model utilities.

* save and load fake features

* rename patterns.pkl to patterns.json

* update poetry.lock

* ruff formatting

* move skops import

* add comments

* clean up save_features and load_features

* WIP: update model data saving and loading

* add tests for save and load features

* update tests for test_tracker_featurizer

* update tests for test_tracker_featurizer

* WIP: serialization of feature arrays.

* update serialization and deserialization for feature array

* remove not needed tests/utils/tensorflow/test_model_data_storage.py

* start writing tests for feature array

* update feature array tests

* update tests

* fix linting

* add changelog

* add new dependencies to .github/dependabot.yml

* fix some tests

* fix loading and saving of unexpected intent ted policy

* fix linting issue

* fix converting of features in cvf and lsf

* fix lint issues

* convert vocab in cvf

* fix linting

* update crf entity extractor

* fix to_dict of crf_token

* addressed type issues

* ruff formatting

* fix typing and lint issues

* remove cloudpickle dependency

* update logistic_regression_classifier and remove joblib as dependency

* update formatting of pyproject.toml

* next try: update formatting of pyproject.toml

* update logging

* update poetry.lock

* refactor loading of lexical_syntactic_featurizer

* rename FeatureMetadata.type -> FeatureMetadata.data_type

* clean up tests test_features.py and test_crf_entity_extractor.py

* update test_feature_array.py

* check for type when loading tracker featurizer.

* update changelog

* fix line too long

* move import of skops

* Prepared release of version 3.10.9.dev1 (#1496)

* prepared release of version 3.10.9.dev1

* update minimum model version

* Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack.

* fix parsing of commands

* improve logging

* formatting

* add changelog

* fix parse commands for multi step

* [ATO-2985] - Windows model loading test (#1537)

* Add test for model loading on windows

* Improve the error message logged when handling the user message

* Add a changelog

* Fix Code Quality - line too long

* Rasa-sdk-update (#1546)

* all rasa-sdk micro updates

* update poetry lock

* update rasa-sdk in lock file

* Remove trailing white sapce

* Prepared release of version 3.10.11 (#1570)

* prepared release of version 3.10.11

* add comments again in pyproject.toml

* update poetry.lock

* revert changes in github workflows

* undo changes in pyproject.toml

* update changelog

* revert changes in github workflows

* update poetry.lock

* update poetry.lock
tabergma added a commit that referenced this issue Jan 10, 2025
* Update slack release notification step

* [ENG-1424] Use `pickle` alternatives (#1453)

* use json.dump and json.load in count_vectors_featurizer and lexical_syntactic_featurizer instead of pickle

* update load and persist in sklearn intent classifier

* update persist and load in dietclassifier

* update load and persist in sklearn intent classifier

* use json.dump and json.load in tracker featurizers

* update persist and load of TEDPolicy

* updated unexpected intent policy persist and load of model utilities.

* save and load fake features

* rename patterns.pkl to patterns.json

* update poetry.lock

* ruff formatting

* move skops import

* add comments

* clean up save_features and load_features

* WIP: update model data saving and loading

* add tests for save and load features

* update tests for test_tracker_featurizer

* update tests for test_tracker_featurizer

* WIP: serialization of feature arrays.

* update serialization and deserialization for feature array

* remove not needed tests/utils/tensorflow/test_model_data_storage.py

* start writing tests for feature array

* update feature array tests

* update tests

* fix linting

* add changelog

* add new dependencies to .github/dependabot.yml

* fix some tests

* fix loading and saving of unexpected intent ted policy

* fix linting issue

* fix converting of features in cvf and lsf

* fix lint issues

* convert vocab in cvf

* fix linting

* update crf entity extractor

* fix to_dict of crf_token

* addressed type issues

* ruff formatting

* fix typing and lint issues

* remove cloudpickle dependency

* update logistic_regression_classifier and remove joblib as dependency

* update formatting of pyproject.toml

* next try: update formatting of pyproject.toml

* update logging

* update poetry.lock

* refactor loading of lexical_syntactic_featurizer

* rename FeatureMetadata.type -> FeatureMetadata.data_type

* clean up tests test_features.py and test_crf_entity_extractor.py

* update test_feature_array.py

* check for type when loading tracker featurizer.

* update changelog

* fix line too long

* move import of skops

* Prepared release of version 3.10.9.dev1 (#1496)

* prepared release of version 3.10.9.dev1

* update minimum model version

* Check for 'step_id' and 'active_flow' keys in the metadata when adding 'ActionExecuted' event to flows paths stack.

* fix parsing of commands

* improve logging

* formatting

* add changelog

* fix parse commands for multi step

* [ATO-2985] - Windows model loading test (#1537)

* Add test for model loading on windows

* Improve the error message logged when handling the user message

* Add a changelog

* Fix Code Quality - line too long

* Rasa-sdk-update (#1546)

* all rasa-sdk micro updates

* update poetry lock

* update rasa-sdk in lock file

* Remove trailing white sapce

* Prepared release of version 3.10.11 (#1570)

* prepared release of version 3.10.11

* add comments again in pyproject.toml

* update poetry.lock

* revert changes in github workflows

* undo changes in pyproject.toml

* update changelog

* revert changes in github workflows

* update poetry.lock

* update poetry.lock

* update pyproject.toml

* update poetry.lock

* update setuptools = '>=65.5.1,<75.6.0'

* update setuptools = '~75.3.0'

* reformat code

* undo deleting of ping_slack_about_package_release.sh

* fix formatting and type issues

* downgrade setuptools to 70.3.0

* fixing logging issues (?)

---------

Co-authored-by: sancharigr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants