Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 14
How to use jebish7/snowflake-arctic-embed-m-long_MNR_half with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jebish7/snowflake-arctic-embed-m-long_MNR_half", trust_remote_code=True)
sentences = [
"According to the Client Money Auditor's Report, how did the Authorised Person manage Client Money—was it pooled in a single client Account or segregated into individual Client Accounts as per COBS Chapter 14?",
"The written notice in Rule 6.2.1(a)(i) must make it explicit that, if an Employee is prohibited from undertaking a Personal Account Transaction, he must not, except in the proper course of his employment:\n(a)\tprocure another Person to enter into such a Transaction; or\n(b)\tcommunicate any information or opinion to another Person if he knows, or ought to know, that the Person will as a result, enter into such a Transaction or procure some other Person to do so.",
"Client Money Auditor's Report:An Authorised Person must, in procuring the production of a Client Money Auditor's Report, ensure that an Auditor states, as at the date of which the Authorised Person's audited statement of financial position was prepared:\n(1)\tthe amount of Client Money an Authorised Person was holding and controlling in accordance with COBS Chapter 14; and\n(2)\twhether:\n(a)\tthe Authorised Person has maintained throughout the year systems and controls to enable it to comply with the relevant provisions of COBS Chapter 14;\n(b)\tthe Authorised Person's controls are such as to ensure that Client Money is identifiable and secure at all times;\n(c)\tany of the requirements in COBS Chapter 14 have not been met;\n(d)\tClient Money has been pooled in a single client Account or segregated in Client Accounts maintained for individual Clients in accordance with COBS Chapter 14;\n(e)\tif applicable, the Authorised Person as holding and controlling the appropriate amount of Client Money in accordance with COBS Chapter 14 as at the date on which the Authorised Person's audited statement of financial position was prepared;\n(f)\tthe Auditor has received all necessary information and explanations for the purposes of preparing the report to the Regulator; and\n(g)\tif applicable, there have been any material discrepancies in the reconciliation of Client Money.",
"CRS Options\n/Table Start\nNo.\tOPTIONS\tCOMMENTS\n1.\tAlternative approach to calculating account balances\tNO\n2.\tUse of other reporting period\tNO\n3.\tFiling deadlines\t30th June\n4.\tFiling Nil returns\tYES\n5.\tAllowing third party service providers to fulfil the obligations on behalf of\nthe Financial Institutions\tYES\n6.\tAllowing the due diligence procedures for New Accounts to be used for\nPre-existing Accounts\tYES\n7.\tAllowing the due diligence procedures for High Value Accounts to be used\nfor Lower Value Accounts\tYES\n8.\tResidence address test for Lower Value Accounts\tYES\n9.\tExclusion from Due Diligence for Pre-existing Entity Accounts not exceeding $250,000\tYES\n10.\tAlternative documentation procedure for certain employer-sponsored\ngroup insurance contracts or annuity contracts\tYES\n11.\tAllowing Financial Institutions to make greater use of existing\nstandardised industry coding systems for the due diligence process\tYES\n12.\tCurrency translation\tUSE USD$\n\n13.\tAllow a Financial Institution to treat certain New Accounts held by pre-existing customers as a Pre-existing Account for due diligence purposes\tYES\n14.\tExpanded definition of Related Entity for Investment Entities\tYES\n15.\tGrandfathering rule for bearer shares issued by Exempt Collective\nInvestment Vehicle\tRemoved\n16.\tPhasing in the requirements to report gross proceeds\tNO\n/Table End\n\n"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m-long on the csv dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NomicBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jebish7/snowflake-arctic-embed-m-long_MNR_half")
# Run inference
sentences = [
'How should a Relevant Person ensure and demonstrate compliance with both UNSC Sanctions and U.A.E.-administered Sanctions, specifically Targeted Financial Sanctions, within the ADGM jurisdiction?',
'Where a Relevant Person seeks to rely on a Person in (1) it may only do so if and to the extent that:\n(a)\tit immediately obtains the necessary CDD information from the third party in (1);\n(b)\tit takes adequate steps to satisfy itself that certified copies of the documents used to undertake the relevant elements of CDD will be available from the third party on request without delay;\n(c)\tthe Person in (1)(b) to (d) is subject to regulation, including AML/TFS compliance requirements, by a Non-ADGM Financial Services Regulator or other competent authority in a country with AML/TFS regulations which are equivalent to the standards set out in the FATF Recommendations and it is supervised for compliance with such regulations;\n(d)\tthe Person in (1) has not relied on any exception from the requirement to conduct any relevant elements of CDD which the Relevant Person seeks to rely on; and\n(e)\tin relation to (2), the information is up to date.',
'REGULATORY REQUIREMENTS - SPOT COMMODITY ACTIVITIES\nRIEs operating an MTF or OTF using Accepted Spot Commodities\nAuthorised Persons that are operating an MTF or OTF wishing to also operate a RIE will be required to relinquish their FSP upon obtaining a Recognition Order (to operate the RIE). If licensed by the FSRA to carry out both Regulated Activities (e.g., operating an MTF and operating an RIE), the Recognition Order will include a stipulation to that effect pursuant to MIR Rule 3.4.1.\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Question and positive| Question | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| Question | positive |
|---|---|
Regarding the assessment of the nature, ownership, and control structure of a customer, could you clarify the level of detail and due diligence expected from a Relevant Person to ensure adherence to regulatory standards? |
The risk assessment under Rule 6.2.1(c) should identify actions to mitigate risks associated with undertaking NFTF business generally, and the use of eKYC specifically. This is because distinct risks are often likely to arise where business is conducted entirely in an NFTF manner, compared to when the business relationship includes a mix of face-to-face and NFTF interactions. The assessment should make reference to risk mitigation measures recommended by the Regulator, a competent authority of the U.A.E., FATF, and other relevant bodies. |
What specific factors should be included in our risk assessment methodology to ensure it aligns with the expectations outlined in Chapter 6 and Chapter 7 of the AML Rulebook? |
The RBA should not be seen as a "tick-box" approach to AML/TFS. Instead a Relevant Person is required to assess relevant money laundering risks and adopt a proportionate response to such risks, however, even where a customer is assessed through the RBA as being low-risk a minimum of simplified CDD must be undertaken in relation to that customer. |
In the event of an investigation by the ADGM, what are the specific obligations and limitations regarding confidentiality for the entity under investigation, and what kind of disclosures are permissible under sections 197 and 198 of FSMR? |
We will not normally make public the fact that we are investigating a matter. We also expect that the person who is the subject of an investigation will treat the matter as confidential. However, subject to the restrictions on disclosure of confidential information in sections 197 and 198 of FSMR, this does not stop the person under investigation from seeking professional advice or making their own enquiries into the matter, giving their auditors appropriate details of the matter or making notifications required by law. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 4learning_rate: 2e-05num_train_epochs: 1warmup_ratio: 0.1batch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 4per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falsebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss |
|---|---|---|
| 0.0271 | 100 | 0.6411 |
| 0.0541 | 200 | 0.3289 |
| 0.0812 | 300 | 0.2395 |
| 0.1083 | 400 | 0.2711 |
| 0.1354 | 500 | 0.2746 |
| 0.1624 | 600 | 0.2602 |
| 0.1895 | 700 | 0.285 |
| 0.2166 | 800 | 0.2965 |
| 0.2436 | 900 | 0.2772 |
| 0.2707 | 1000 | 0.3043 |
| 0.2978 | 1100 | 0.3059 |
| 0.3249 | 1200 | 0.316 |
| 0.3519 | 1300 | 0.2765 |
| 0.3790 | 1400 | 0.249 |
| 0.4061 | 1500 | 0.2601 |
| 0.4331 | 1600 | 0.2538 |
| 0.4602 | 1700 | 0.2443 |
| 0.4873 | 1800 | 0.2151 |
| 0.5143 | 1900 | 0.2335 |
| 0.5414 | 2000 | 0.2611 |
| 0.5685 | 2100 | 0.2557 |
| 0.5956 | 2200 | 0.2793 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m-long