语义分块器¶
“语义分块”是 Greg Kamradt 在他关于 embedding 分块的 5 个级别的视频教程中提出的一个新概念:https://youtu.be/8OJC21T2SL4?t=1933。
与使用**固定**块大小进行文本分块不同,语义分割器使用 embedding 相似度自适应地选择句子之间的断点。这确保了“块”包含在语义上相互关联的句子。
我们将其改编为 LlamaIndex 模块。
查看下面的笔记本!
注意事项
- 该正则表达式主要适用于英语句子
- 您可能需要调整断点百分位数阈值。
设置数据¶
%pip install llama-index-embeddings-openai
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'pg_essay.txt'
Will not apply HSTS. The HSTS database must be a regular and non-world-writable file. ERROR: could not open HSTS store at '/home/loganm/.wget-hsts'. HSTS will be disabled. --2024-01-11 15:04:43-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘pg_essay.txt’ pg_essay.txt 100%[===================>] 73.28K --.-KB/s in 0.04s 2024-01-11 15:04:44 (1.76 MB/s) - ‘pg_essay.txt’ saved [75042/75042]
from llama_index.core import SimpleDirectoryReader
# load documents
documents = SimpleDirectoryReader(input_files=["pg_essay.txt"]).load_data()
定义语义分割器¶
from llama_index.core.node_parser import (
SentenceSplitter,
SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)
# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)
nodes = splitter.get_nodes_from_documents(documents)
print(nodes[1].get_content())
I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep. The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights. The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer. I was puzzled by the 1401.
块 2: 个人电脑 + 大学¶
print(nodes[2].get_content())
I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to do things that didn't rely on any input, like calculate approximations of pi, but I didn't know enough math to do anything interesting of that type. So I'm not surprised I can't remember any programs I wrote, because they can't have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn't. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager's expression made clear. With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1] The first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer. Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored. I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI. AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were. This was more like it; this was what I had expected college to do. It wasn't happening in a class, like it was supposed to, but that was ok.
块 3: 完成大学 + 研究生院¶
print(nodes[3].get_content())
For the next couple years I was on a roll. I knew what I was going to do. For my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief — hard to imagine now, but not unique in 1985 — that it was already climbing the lower slopes of intelligence. I had gotten into a program at Cornell that didn't make you choose a major. You could take whatever classes you liked, and choose whatever you liked to put on your degree. I of course chose "Artificial Intelligence." When I got the actual physical diploma, I was dismayed to find that the quotes had been included, which made them read as scare-quotes. At the time this bothered me, but now it seems amusingly accurate, for reasons I was about to discover. I applied to 3 grad schools: MIT and Yale, which were renowned for AI at the time, and Harvard, which I'd visited because Rich Draves went there, and was also home to Bill Woods, who'd invented the type of parser I used in my SHRDLU clone. Only Harvard accepted me, so that was where I went. I don't remember the moment it happened, or if there even was a specific moment, but during the first year of grad school I realized that AI, as practiced at the time, was a hoax. By which I mean the sort of AI in which a program that's told "the dog is sitting on the chair" translates this into some formal representation and adds it to the list of things it knows. What these programs really showed was that there's a subset of natural language that's a formal language. But a very proper subset. It was clear that there was an unbridgeable gap between what they could do and actually understanding natural language. It was not, in fact, simply a matter of teaching SHRDLU more words. That whole way of doing AI, with explicit data structures representing concepts, was not going to work. Its brokenness did, as so often happens, generate a lot of opportunities to write papers about various band-aids that could be applied to it, but it was never going to get us Mike. So I looked around to see what I could salvage from the wreckage of my plans, and there was Lisp. I knew from experience that Lisp was interesting for its own sake and not just for its association with AI, even though that was the main reason people cared about it at the time. So I decided to focus on Lisp. In fact, I decided to write a book about Lisp hacking. It's scary to think how little I knew about Lisp hacking when I started writing that book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school. Computer Science is an uneasy alliance between two halves, theory and systems. The theory people prove things, and the systems people build things. I wanted to build things. I had plenty of respect for theory — indeed, a sneaking suspicion that it was the more admirable of the two halves — but building things seemed so much more exciting. The problem with systems work, though, was that it didn't last. Any program you wrote today, no matter how good, would be obsolete in a couple decades at best. People might mention your software in footnotes, but no one would actually use it. And indeed, it would seem very feeble work. Only people with a sense of the history of the field would even realize that, in its time, it had been good. There were some surplus Xerox Dandelions floating around the computer lab at one point. Anyone who wanted one to play around with could have one. I was briefly tempted, but they were so slow by present standards; what was the point? No one else wanted one either, so off they went. That was what happened to systems work. I wanted not just to build things, but to build things that would last. In this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old. And moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding. I had always liked looking at paintings. Could I make them?
与基线比较¶
相比之下,让我们与使用固定块大小的基线进行比较。
base_nodes = base_splitter.get_nodes_from_documents(documents)
print(base_nodes[2].get_content())
This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter. Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored. I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI. AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words. There weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere.
设置查询引擎¶
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()
base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()
运行一些查询¶
response = query_engine.query(
"Tell me about the author's programming journey through childhood to college"
)
print(str(response))
The author's programming journey began in childhood when computers were expensive and not easily accessible. They couldn't do much with computers at that time as the only form of input was data stored on punched cards, which they didn't have. They didn't know enough math to do anything interesting either. However, with the advent of microcomputers, everything changed. The author's friend built a microcomputer from a kit, which impressed and envied the author. Eventually, the author convinced their father to buy a TRS-80 computer, which marked the start of their programming journey. They wrote simple games, a program to predict rocket heights, and even a word processor. Despite their interest in programming, the author initially planned to study philosophy in college but found it boring. They then switched to studying AI, which was in the air during the mid-1980s. The author taught themselves AI since there were no classes available at Cornell at that time. They learned Lisp, which expanded their concept of programming and opened up new possibilities.
for n in response.source_nodes:
display_source_node(n, source_length=20000)
节点 ID: 68006b95-c06e-486c-bbb6-be54746aaf22
相似性 0.8465522042661249
文本: 我搞不清楚能用它做什么。事后看来,我确实也没法用它做什么。程序的唯一输入形式是存储在穿孔卡片上的数据,而我没有任何穿孔卡片上的数据。唯一的其他选择是做一些不依赖任何输入的事情,比如计算圆周率的近似值,但我当时数学知识不够,做不了那种有趣的计算。所以我不惊讶我记不起我写过的任何程序,因为它们肯定没做什么。我最清晰的记忆是当我得知程序有可能不会终止的那一刻,当时我的一个程序就没有终止。在没有分时系统的机器上,这既是技术错误也是社交错误,数据中心管理员的表情清楚地说明了这一点。
有了微型计算机,一切都变了。现在你可以有一台电脑就放在你面前的桌子上,它在运行时可以响应你的击键,而不是仅仅处理一堆穿孔卡片然后停止。[1]
我的第一个拥有微型计算机的朋友是自己组装的。那是 Heathkit 公司作为套件出售的。我清楚地记得,当我看着他坐在电脑前,直接将程序输入电脑时,我感到多么惊讶和羡慕。
那时电脑很贵,我唠叨了好几年才说服我父亲买了一台 TRS-80,大概在 1980 年。当时的黄金标准是 Apple II,但 TRS-80 也足够好了。这是我真正开始编程的时候。我写了一些简单的游戏,一个预测我的模型火箭能飞多高的程序,还有一个我父亲用来写至少一本书的文字处理器。内存里只能容纳大约两页文本,所以他一次写两页,然后打印出来,但这比打字机好多了。
虽然我喜欢编程,但我没打算在大学里学习它。在大学里,我打算学习哲学,这听起来更有力量。对我这个天真的高中生来说,哲学似乎是研究终极真理的学科,相比之下,其他领域研究的东西只是领域知识。当我上了大学后,我发现其他领域占据了思想空间的大部分,以至于这些所谓的终极真理剩下的空间不多了。哲学似乎只剩下那些其他领域的人认为可以安全忽略的边缘案例。
我 18 岁时无法用语言表达出来。当时我只知道我一直上哲学课,而且一直都很无聊。所以我决定转去人工智能专业。
人工智能在 20 世纪 80 年代中期很流行,但有两件事尤其让我想从事这项工作:一本海因莱因的小说《月亮是一个严厉的主妇》(The Moon is a Harsh Mistress),书中有一个名叫 Mike 的智能计算机,以及一部 PBS 纪录片,展示了 Terry Winograd 使用 SHRDLU 的场景。我还没尝试重读《月亮是一个严厉的主妇》,所以不知道它是否过时,但当我读它时,我完全被它的世界吸引了。那时似乎只需要时间我们就能拥有 Mike,当我看到 Winograd 使用 SHRDLU 时,感觉最多也就是几年的时间。你需要做的就是教 SHRDLU 更多的词汇。
当时康奈尔大学还没有人工智能的课程,甚至连研究生课程都没有,所以我开始尝试自学。这意味着学习 Lisp,因为在那些日子里,Lisp 被认为是人工智能的语言。当时常用的编程语言相当原始,程序员的思想也相应地原始。康奈尔大学的默认语言是一种类似 Pascal 的语言,叫做 PL/I,其他地方的情况也类似。学习 Lisp 如此迅速地扩展了我对程序的概念,以至于过了好几年我才开始感觉到新的界限在哪里。这才像话;这就是我期望大学能做到的事情。它不是像应该的那样发生在课堂上,但这没关系。
节点 ID: a7cc0ef9-400e-47b3-a85b-fb871bfd183d
相似性 0.8460437724191147
文本: 我完全没有概念。我从未想过这甚至可能。我在理智上知道有人创作艺术——它不是凭空出现的——但好像创作艺术的人是另一个物种。他们要么很久以前就存在,要么是神秘的天才,在《生活》杂志的简介里做着奇怪的事情。能够真正创作艺术,将那个动词放在那个名词之前,这个想法几乎令人感觉不可思议。
那个秋天,我开始在哈佛大学上艺术课。研究生可以在任何系里选课,我的导师 Tom Cheatham 非常随和。即使他知道我正在上这些奇怪的课程,他也从未说什么。
所以那时我正在攻读计算机科学博士,却计划成为一名艺术家,同时又真诚地热爱 Lisp 黑客技术并埋头研究 On Lisp。换句话说,就像许多研究生一样,我正充满活力地同时进行多个与我的论文无关的项目。
我看不出如何摆脱这种情况。我不想退学,但我还能怎么出去呢?我记得我的朋友 Robert Morris 因为编写了 1988 年的互联网蠕虫病毒而被康奈尔大学开除,我当时羡慕他找到了这样一个引人注目的方法来摆脱研究生院。
然后在 1990 年 4 月的一天,墙上出现了一道裂缝。我碰到了 Cheatham 教授,他问我是否已准备好在六月毕业。我一个字论文都没写,但凭着我一生中最快的一次思考,我决定在截止日期前剩下的五周左右的时间里尝试写一篇,尽可能地重用 On Lisp 中的部分内容,然后我毫不迟疑地回答说:“是的,我想是这样。我几天后会给你一些东西读。”
我选择了连续应用(applications of continuations)作为主题。回想起来,我应该写关于宏和嵌入式语言。那里有一个几乎未被探索的完整世界。但那时我只想离开研究生院,我匆忙写成的论文勉强足够了。
同时我正在申请艺术学校。我申请了两所:美国的 RISD 和佛罗伦萨的 Accademia di Belli Arti。因为它是最古老的艺术学校,我想象它会很好。RISD 录取了我,而我从未收到 Accademia 的回复,所以我去了普罗维登斯。
我申请了 RISD 的 BFA 项目,这实际上意味着我必须再次上大学。这不像听起来那么奇怪,因为我只有 25 岁,而艺术学校里有不同年龄的人。RISD 将我算作转校二年级学生,并说我必须在那个夏天完成基础课程。基础课程是指每个人都必须学习的绘画、色彩、设计等基础科目的课程。
夏天快结束时,我收到了一个大惊喜:一封来自 Accademia 的信。信件延迟了,因为他们寄到了英国剑桥,而不是马萨诸塞州的剑桥。信中邀请我当年秋天去佛罗伦萨参加入学考试。
base_response = base_query_engine.query(
"Tell me about the author's programming journey through childhood to college"
)
print(str(base_response))
The author's programming journey began in childhood when they started writing simple games and programs to predict the flight of model rockets. They also developed a word processor that their father used to write a book. Despite their interest in programming, they initially planned to study philosophy in college. However, they found philosophy courses to be boring and decided to switch to AI. At that time, there were no AI classes at Cornell, so they taught themselves by learning Lisp, which was considered the language of AI. The author's programming journey continued to evolve as they encountered new technologies, such as microcomputers, which allowed for more interactive and accessible programming experiences.
for n in base_response.source_nodes:
display_source_node(n, source_length=20000)
节点 ID: 6c0de686-e1be-4ece-b514-7ed6f732b043
相似性 0.8637606779131186
文本: 这是我真正开始编程的时候。我写了一些简单的游戏,一个预测我的模型火箭能飞多高的程序,还有一个我父亲用来写至少一本书的文字处理器。内存里只能容纳大约两页文本,所以他一次写两页,然后打印出来,但这比打字机好多了。
虽然我喜欢编程,但我没打算在大学里学习它。在大学里,我打算学习哲学,这听起来更有力量。对我这个天真的高中生来说,哲学似乎是研究终极真理的学科,相比之下,其他领域研究的东西只是领域知识。当我上了大学后,我发现其他领域占据了思想空间的大部分,以至于这些所谓的终极真理剩下的空间不多了。哲学似乎只剩下那些其他领域的人认为可以安全忽略的边缘案例。
我 18 岁时无法用语言表达出来。当时我只知道我一直上哲学课,而且一直都很无聊。所以我决定转去人工智能专业。
人工智能在 20 世纪 80 年代中期很流行,但有两件事尤其让我想从事这项工作:一本海因莱因的小说《月亮是一个严厉的主妇》(The Moon is a Harsh Mistress),书中有一个名叫 Mike 的智能计算机,以及一部 PBS 纪录片,展示了 Terry Winograd 使用 SHRDLU 的场景。我还没尝试重读《月亮是一个严厉的主妇》,所以不知道它是否过时,但当我读它时,我完全被它的世界吸引了。那时似乎只需要时间我们就能拥有 Mike,当我看到 Winograd 使用 SHRDLU 时,感觉最多也就是几年的时间。你需要做的就是教 SHRDLU 更多的词汇。
当时康奈尔大学还没有人工智能的课程,甚至连研究生课程都没有,所以我开始尝试自学。这意味着学习 Lisp,因为在那些日子里,Lisp 被认为是人工智能的语言。当时常用的编程语言相当原始,程序员的思想也相应地原始。康奈尔大学的默认语言是一种类似 Pascal 的语言,叫做 PL/I,其他地方的情况也类似。
节点 ID: c5ba0780-d9d7-436e-9730-ce7fe44539c1
相似性 0.8571409465192146
文本: 我的工作
2021 年 2 月
在大学之前,除了学习之外,我主要做了两件事:写作和编程。我不是写论文。我写的是初学者当时应该写、现在可能仍然应该写的东西:短篇小说。我的小说很糟糕。它们几乎没有情节,只有情感强烈的角色,我以为这样会让它们显得深刻。
我第一次尝试编写程序是在 IBM 1401 上,我们学区用它来处理当时所谓的“数据处理”。那时我九年级,所以是 13 或 14 岁。学区的 1401 机器恰好在我们初中地下室,我和朋友 Rich Draves 得到了使用许可。那里就像一个迷你邦德反派巢穴,所有这些看起来像外星机器的东西——CPU、磁盘驱动器、打印机、读卡器——都放在高架地板上,在明亮的荧光灯下。
我们使用的语言是 Fortran 的早期版本。你必须把程序打在穿孔卡片上,然后把它们叠放在读卡器里,按下按钮把程序加载到内存中并运行。结果通常会在那台声音奇大的打印机上打印出来。
我对 1401 感到困惑。我搞不清楚能用它做什么。事后看来,我确实也没法用它做什么。程序的唯一输入形式是存储在穿孔卡片上的数据,而我没有任何穿孔卡片上的数据。唯一的其他选择是做一些不依赖任何输入的事情,比如计算圆周率的近似值,但我当时数学知识不够,做不了那种有趣的计算。所以我不惊讶我记不起我写过的任何程序,因为它们肯定没做什么。我最清晰的记忆是当我得知程序有可能不会终止的那一刻,当时我的一个程序就没有终止。在没有分时系统的机器上,这既是技术错误也是社交错误,数据中心管理员的表情清楚地说明了这一点。
有了微型计算机,一切都变了。现在你可以有一台电脑就放在你面前的桌子上,它在运行时可以响应你的击键,而不是仅仅处理一堆穿孔卡片然后停止。
response = query_engine.query("Tell me about the author's experience in YC")
print(str(response))
The author had a significant experience in Y Combinator (YC). They initially did not intend for YC to be a full-time job, but as it grew, it started to take up more of their attention. They worked on various projects within YC, including selecting and helping founders, writing essays, and working on internal software. The author found the work engaging and enjoyed the opportunity to learn about startups. However, there were also parts of the job that they did not like, such as disputes between cofounders and dealing with maltreatment of startups. Despite the challenges, the author worked hard and wanted YC to be successful.
base_response = base_query_engine.query(
"Tell me about the author's experience in YC"
)
print(str(base_response))
The author's experience in YC was different from other kinds of work they have done. Instead of deciding for themselves what to work on, the problems came to them. Every 6 months, there was a new batch of startups, and their problems became the author's problems. This work was engaging because the problems were varied, and the good founders were very effective. However, there were parts of the job that the author didn't like, such as disputes between cofounders and dealing with people who maltreated the startups. Despite this, the author worked hard even at the parts they didn't like because they wanted YC to be good.