Sample functions for NLP using CNTK of MarkLogic10.
The following functions are implemented by XQuery. These functions uses word2vec ONNX model learned using the Japanese Wikipedia. This sample model has learned about 350,000 Japanese words.
This sample calculate the cosine distance between two words learned by word2vec.
The following MarkLogic CNTK functions are mainly used.
- cntk:one-hot-op
- cntk:embedding-layer
- cntk:cosine-distance
- cntk:batch-of-sequences
- cntk:evaluate
This sample gets the top-k values from the argument array.
The following MarkLogic CNTK functions are mainly used.
- cntk:top-k
- cntk:evaluate
This sample searches for similar words learned by word2vec.
The following MarkLogic CNTK functions are mainly used.
- cntk:one-hot-op
- cntk:embedding-layer
- cntk:cosine-distance
- cntk:batch-of-sequences
- cntk:evaluate
This sample is the word analogy.
The following MarkLogic CNTK functions are mainly used.
- cntk:one-hot-op
- cntk:embedding-layer
- cntk:minus
- cntk:plus
- cntk:sqrt
- cntk:element-divide
- cntk:element-times
- cntk:reduce-sum-on-axes
- cntk:evaluate
The sample codes above uses the following word2vec ONNX model and vocabulary list.
-
wikipedia_w2v_model.onnx
This is a sample ONNX model generated using Keras with CNTK. This model was learned over 350,000 Japanse words on wikipedia.
-
wikipedia_vocab.csv
This is a list of word indices. Load this file into the MarkLogic in JSON format by MLCP.
-
Load wikipedia_vocab.csv to MarkLogic
Convert this file to JSON format, load it under /vocab/. See load_vocab.sh for the MLCP command.
-
Load wikipedia_w2v_model.onnx
Load this file into the MarkLogic under /model/. See load_model.sh for the MLCP command.
-
Execute XQuery samples