强烈建议你试试无所不能的chatGPT，快点击我

deeplearning.ai - 自然语言处理与词嵌入

阅读量：4089 次

发布时间：2019-05-25

本文共 1576 字，大约阅读时间需要 5 分钟。

吴恩达 Andrew Ng

Natural Language Processing & Word Embeddings

Introduction to Word Embeddings

Word Representation 词汇表征

1-hot representation
- 词是孤立的，之间没有关联
- 任意两个词向量的内积为0
- 相关词的泛化能力不强

fearurized representation - word embedding 词嵌入
- 一个单词用多个特征组成的向量来表示 more dense vector
- t-SNE算法用于可视化，把高维向量映射到低维空间

Using word embeddings

Learn word embeddings from large text corpus.

Transfer embedding to new task with smaller training set.

Continue to finetune the word embeddings with new data. (optional)

每一个词汇表的单词的固定嵌入，学习一个固定的编码

人脸识别中的算法可能涉及到海量的人脸照片，而自然语言处理有一个固定的词汇表，一些没有出现过的单词记为未知单词

Properties of word embeddings

analogies using word vector 类比

余弦相似度 cosine similarity, $sim({\bf u}, {\bf v})=\frac{\bf u^T \bf v}{\Vert \bf u\Vert _2\Vert\bf v\Vert _2}$

需要足够大的语料库

Embedding matrix

$E \cdot o_i = e_i$ 可提取某一单词的嵌入向量

在实践中使用一个专门的函数来单独查找矩阵��的某列，而不是用通常的矩阵乘法来做

Learning Word Embeddings: Word2vec & GloVe

Learning word embeddings

fixed historical window 只看前n个单词来预测下一个

netural language model

Word2Vec

skip-gram: 根据 context word 预测 target word

用附近的一个单词作为上下文

hierarchical softmax classifier 分级softmax分类器，加速分类

目标词分布并不是单纯的在训练集语料库上均匀且随机的采样得到的，而是采用了不同的分级，来平衡常见的词和不常见的词

CBOW 连续词袋模型(Continuous Bag-Of-Words Model)是从原始语句推测目标字词；Skip-Gram 正好相反，是从目标字词推测出原始语句

Negative Sampling 负采样

context word, target word, label

从字典中随机选取其他的词，标记为负样本

转换为二分类问题

选取负样本

GloVe word vectors

global vectors for word representation

$X_{ij}$ is a count that captures how often do words i and j appear close to each other

Applications using Word Embeddings

Sentiment Classification 情感分类

取平均就忽略了语序

RNN

Debiasing word embeddings 词嵌入除偏

SVD singular value decomposition 奇异值分解

你可能感兴趣的文章

JavaScript基础1：JavaScript 错误 - Throw、Try 和 Catch

SQL基础总结——20150730

JavaScript实现页面无刷新让时间走动

CSS实例：Tab选项卡效果

前端设计之特效表单

前端设计之CSS布局：上中下三栏自适应高度CSS布局

Java的时间操作玩法实例若干

JavaScript:时间日期格式验证大全

pinyin4j:拼音与汉字的转换实例

XML工具代码:SAX从String字符串XML内获取指定节点或属性的值

时间日期：获取两个日期相差几天

责任链模式 Chain of Responsibility

高并发与大数据解决方案概述

解决SimpleDateFormat线程安全问题NumberFormatException: multiple points

MySQL数据库存储引擎简介

处理Maven本地仓库.lastUpdated文件

Kafka | 请求是怎么被处理的？

Java并发编程1-线程池

CentOS7，玩转samba服务，基于身份验证的共享

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！-- 愿君每日到此一游！

当前时间: 2024-09-20 15:22:07 当前IP: 3.138.113.101 联系邮箱:javaeecc@qq.com Copyright © 2020 - 2022 baihongyu.com 京ICP备2021015314号-2

强烈建议你试试无所不能的CHAT-GPT，快点击我

强烈建议你试试无所不能的CHAT-GPT，快点击我