将向量提取器用于平行语料对齐的一个小示例

发布时间 2023-12-08 17:38:34作者: 绝不原创的飞龙
from sentence_transformers import SentenceTransformer 
import numpy as np
from os import path

model_path = (
    '/data/m3e-base' 
    if path.isdir('/data/m3e-base') 
    else 'moka-ai/m3e-base'
)
model = SentenceTransformer(model_path)

zh_list = [
    "国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。",
    "瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。",
    "2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。",
    "新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。",
    "费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻,难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。",
] 
en_list = [
    "On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years. " ,
    "New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today. " ,
    "QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators." ,
    "Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others. " ,
    "The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research." ,
]

zh_vecs = model.encode(zh_list)
en_vecs = model.encode(en_list)

def l2_norm(arr, axis=-1):
    return (arr ** 2).sum(axis=axis, keepdims=True) ** 0.5

en_vecs /= l2_norm(en_vecs)
zh_vecs /= l2_norm(zh_vecs)

sim_mat = en_vecs @ zh_vecs.T
sims = np.sort(sim_mat, axis=-1)[:, ::-1]
idcs = np.argsort(sim_mat, axis=-1)[:, ::-1]

idcs_top1 = idcs[:, 0].ravel()
sims_top1 = sims[:, 0].ravel()

for i, (j, sim) in enumerate(zip(idcs_top1, sims_top1)):
    print(en_list[i] + '\n' + zh_list[j] + f'\n相似度:{sim}\n' + '=' * 30)

'''
On November 10th, 2022, Forbes published the 2022 China Mainland Rich List. The total wealth of the people on this list dropped from $1.48 trillion last year to $907.1 billion, a drop of 39%, which was the biggest drop since Forbes surveyed the richest people in mainland China for more than 20 years.
2022年11月10日,《福布斯》发布2022中国内地富豪榜。本次上榜者的财富总额从去年的1.48万亿美元下降至9,071亿美元,跌幅达到39%,并创下了《福布斯》调查中国内地富豪20多年以来的最大跌幅。
相似度:0.7973945736885071
==============================
New energy refers to various forms of energy other than traditional energy. All its forms come directly or indirectly from the heat energy generated by the sun or the earth. Including solar energy, wind energy, biomass energy, geothermal energy, water energy and ocean energy, as well as energy generated by biofuels and hydrogen derived from renewable energy. It can also be said that new energy includes all kinds of renewable energy and nuclear energy. Compared with traditional energy sources, new energy sources generally have the characteristics of less pollution and large reserves, which is of great significance to solve the serious environmental pollution problem and the depletion of resources (especially fossil energy) in the world today.
新能源是指传统能源之外的各种能源形式。它的各种形式都是直接或者间接地来自于太阳或地球内部所产生的热能。包括太阳能、风能、生物质能、地热能、水能和海洋能以及由可再生能源衍生出来的生物燃料和氢所产生的能量。也可以说,新能源包括各种可再生能源和核能。相对于传统能源,新能源普遍具有污染少、储量大的特点,对于解决当今世界严重的环境污染问题和资源(特别是化石能源)枯竭问题具有重要意义。
相似度:0.8789420127868652
==============================
QS Quacquarelli Symonds, an international higher education research institution, officially released the 20th edition of the World University Rankings on June 28th, 2023, which brought employability and sustainable development indicators into the ranking system for the first time, becoming the only ranking in the world that includes both indicators.
国际高等教育研究机构QS Quacquarelli Symonds于2023年6月28日正式发布第20版世界大学排名,首次将就业能力和可持续发展指标纳入排名体系,成为全球唯一一个同时包含这两项指标的排名。
相似度:0.8807516098022461
==============================
Feynman learning method can be simplified to four words: Concept, Teach, Review and Simplify. Feynman's learning method is inspired by Richard Feynman, the Nobel Prize winner in physics. With Feynman's skills, you can understand the knowledge points in depth in just 20 minutes, and it is memorable and hard to forget. There are two types of knowledge, and most of us pay attention to the wrong kind. The first kind of knowledge focuses on knowing the name of something. The second kind of knowledge focuses on understanding something. This is not the same thing. Richard Feynman, a famous Nobel physicist, can understand the difference between the two, which is one of the most important reasons for his success. In fact, he created a learning method to ensure that he would know things better than others.
费曼学习法可以简化为四个单词:Concept (概念)、Teach (教给别人)、Review (评价)、Simplify (简化)。  费曼学习法的灵感源于诺贝尔物理奖获得者理查德•费曼(Richard Feynman),运用费曼技巧,你只需花上20分钟就能深入理解知识点,而且记忆深刻, 难以遗忘。知识有两种类型,我们绝大多数人关注的都是错误的那类。第一类知识注重了解某个事物的名称。第二类知识注重了解某件事物。这可不是一回事儿。著名的诺贝尔物理学家理查德·费曼(Richard Feynman)能够理解这二者间的差别,这也是他成功最重要的原 因之一。事实上,他创造了一种学习方法,确保他会比别人对事物了解的更透彻。
相似度:0.8909085988998413
==============================
The Royal Swedish Academy of Sciences announced in Stockholm on October 10th, 2022 that it would award the 2022 Nobel Prize in Economics to economists Ben Bernanke, Douglas Diamond and Philip Dybvig in recognition of their outstanding contributions in the field of banking and financial crisis research.
瑞典皇家科学院2022年10月10日在斯德哥尔摩宣布,将2022年诺贝尔经济学奖授予经济学家本·伯南克(Ben Bernanke)、道格拉斯·戴蒙德(Douglas Diamond)和菲利普·迪布维格(Philip Dybvig),以表彰他们在银行与金融危机研究领域的突出贡献。
相似度:0.8677741289138794
==============================
'''