[0] Comment: 

Translate to traditional Chinese:

Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes

Sam Stone

Source: DreamStudio (generated by author)

Ask GPT-4 to prove there are infinite prime numbers — while rhyming — and it delivers. But ask it how your team performed vs plan last quarter, and it will fail miserably. This illustrates a fundamental challenge of large language models (“LLMs”): they have a good grasp of general, public knowledge (like prime number theory), but are entirely unaware of proprietary, non-public information (how your team did last quarter.)[1] And proprietary information is critical to the vast majority of enterprise use workflows. A model that understands the public internet is cute, but little use in its raw form to most organizations.

[0] 0/0.3320 Sat Jun 03 2023 11:49:18 GMT+0800 (台北標準時間)


[1] Comment: 

將 LLM 應用於企業數據:概念、關註點和熱門話題 Sam Stone

來源:DreamStudio(作者生成)

詢問 GPT-4 證明有無限素數 - 同時押韻 - 它可以做到。但是,如果要問它上季度你的團隊與計劃相比表現如何,它將會失敗。這說明瞭大型語言模型("LLMs")面臨的一個根本性挑戰:它們對於公共知識(例如質數理論)具有良好的掌握能力,但完全不知道專有、非公開信息(例如你的團隊上季度表現如何)。[1] 專有信息對於絕大多數企業使用工作流程至關重要。瞭解公共網際網路的模型很可愛,但在其原始形式下對大多數組織來說幫助甚微。

[1] ChatVD1500 0.0095/0.3320 Sat Jun 03 2023 11:47:01 GMT+0800 (台北標準時間)


[2] Comment: 

Over the past year, I’ve had the privilege of working with a number of organizations applying LLMs to enterprise use cases. This post details key concepts and concerns that anyone embarking on such a journey should know, as well as a few hot-takes on how I think LLMs will evolve and implications for ML product strategy. It’s intended for product managers, designers, engineers and other readers with limited or no knowledge of how LLMs work “under the hood”, but some interest in learning the concepts without going into technical details.

[2] 0/0.3320 Sat Jun 03 2023 11:47:00 GMT+0800 (台北標準時間)


[3] Comment: 

在過去的一年中,我有幸與許多組織合作,將LLMs應用於企業使用案例。本文詳細介紹了任何開始這樣的旅程時應該知道的關鍵概念和問題,以及我認為LLMs將如何發展以及對ML產品策略的影響。它面向具有有限或沒有關於LLMs“內部工作”的知識但對學習概念感興趣而不涉及技術細節的產品經理、設計師、工程師和其他讀者。

[3] ChatVD1500 0.0093/0.3320 Sat Jun 03 2023 11:46:59 GMT+0800 (台北標準時間)


[4] Comment: 

Four Concepts

Prompt Engineering, Context Windows, and Embeddings The simplest way to make an LLM reason about proprietary data is to provide the proprietary data in the model’s prompt. Most LLMs would have no problem answering the following correctly: “We have 2 customers, A and B, who spent $100K and $200K, respectively. Who was our largest customer and how much did they spend?” We’ve just done some basic prompt engineering, by prepending our query (the second sentence) with context (the first sentence).

[4] 0/0.3320 Sat Jun 03 2023 11:46:58 GMT+0800 (台北標準時間)


[5] Comment: 

四個概念 提示工程、上下文視窗和嵌入 使LLM理解專有數據的最簡單方法是在模型的提示中提供專有數據。大多數LLMs都可以正確回答以下問題:“我們有兩個客戶,A和B,分別花費了10萬美元和20萬美元。誰是我們最大的客戶,他們花費了多少錢?” 我們剛剛進行了一些基本的提示工程,通過將查詢(第二句話)與上下文(第一句話)連接起來。

[5] ChatVD1500 0.0096/0.3320 Sat Jun 03 2023 11:46:58 GMT+0800 (台北標準時間)


[6] Comment: 

Translate to traditional Chinese:

But in the real world, we may have thousands or millions of customers. How do we decide which information should go into the context — considering that each word included in the context costs money ? This is where embeddings come in. Embeddings are a method in which text is transformed into numerical vectors, in which similar text generates similar vectors (vectors that are “close together” in N-dimensional space).[2] We might embed website text, documents, maybe even an entire corpus from SharePoint, Google Docs, or Notion. Then, for each user prompt, we embed it and find the vectors from our text corpus that are most similar to our prompt vector.

[6] 0/0.3320 Sat Jun 03 2023 11:46:47 GMT+0800 (台北標準時間)


[7] Comment: 

然而,在現實世界中,我們可能有數千或數百萬的客戶。考慮到上下文中包含的每個單詞都需要花費資金,我們該如何決定哪些信息應該進入上下文?這就是嵌入技術發揮作用的地方。嵌入技術是一種將文本轉換為數字向量的方法,其中類似的文本會生成相似的向量(在N維空間中“緊密”排列)。[2] 我們可以對網站文本、文件甚至來自SharePoint、Google Docs或Notion等整個語料庫進行嵌入處理。然後,針對每個用戶提示,我們進行嵌入處理,並查找與我們提示向量最相似的來自我們文本語料庫的向量。

[7] ChatVD1500 0.0106/0.3320 Sat Jun 03 2023 11:46:47 GMT+0800 (台北標準時間)


[8] Comment: 

Translate to Chinese:

For example, if we embedded Wikipedia pages on animals, and the user asked a question about safaris 野生動物園, our search would rank highly the Wikipedia articles about lions, zebra, and giraffes (而不是豬、羊、水牛). This allows us to identify the text chunks most similar to the prompt — and thus most likely to answer it.[3] We include these most similar text chunks in the context that is prepended to the prompt, so that the prompt hopefully contains all the information necessary for the LLM to answer the question.

[8] ChatVD1500 0.0098/0.3320 Sat Jun 03 2023 11:46:47 GMT+0800 (台北標準時間)


[9] Comment: 

例如,如果我們將維基百科上有關動物的頁面嵌入到模型中,當用戶詢問有關野生動物園的問題時,我們會優先排列與獅子、斑馬和長頸鹿相關的維基百科文章。這使我們能夠識別最接近提示的文本塊-從而最有可能回答問題。[3] 我們將這些最相似的文本塊包含在前置於提示之前的上下文中,以便提示希望包含LLM回答問題所需的所有信息。

[9] ChatVD1500 0.0107/0.3320 Sat Jun 03 2023 11:46:46 GMT+0800 (台北標準時間)


[10] Comment: 

For example, if we embedded Wikipedia pages on animals, and the user asked a question about safaris, our search would rank highly the Wikipedia articles about lions, zebra, and giraffes. This allows us to identify the text chunks most similar to the prompt — and thus most likely to answer it.[3] We include these most similar text chunks in the context that is prepended to the prompt, so that the prompt hopefully contains all the information necessary for the LLM to answer the question.

[10] 0/0.3320 Sat Jun 03 2023 11:46:46 GMT+0800 (台北標準時間)


[11] Comment: 

Fine-Tuning

A downside of embeddings is that every call to the LLM requires all the context to be passed with the prompt. The LLM has no “memory” of even the most basic enterprise-specific concepts. And since most cloud-based LLM providers charge per prompt token, this can get expensive fast.[4]

[11] 0/0.3320 Sat Jun 03 2023 11:46:46 GMT+0800 (台北標準時間)


[12] Comment: 

微調(Fine-Tuning)

嵌入的一個缺點是,每次調用LLM都需要將所有上下文與提示一起傳遞。 LLM甚至沒有記憶最基本的企業特定概念。由於大多數基於雲計算的LLM提供商按提示令牌收費,因此這可能會很快變得昂貴。[4]

[12] ChatVD1500 0.0105/0.3320 Sat Jun 03 2023 11:46:46 GMT+0800 (台北標準時間)


[13] Comment: 

微調是一種讓LLM理解企業特定概念而無需在每個提示中包含它們的方法。我們採用一個基礎模型,該模型已經編碼了數十億個學習參數的通用知識,並調整這些參數以反映特定的企業知識,同時仍保留底層通用知識。當我們使用新的微調模型生成推論時,我們可以“免費”獲得那些企業知識。

與嵌入/提示工程不同,其中基礎模型是第三方黑盒子,在微調中更接近於傳統機器學習,其中ML團隊從頭開始創建自己的模型。微調需要具有標記觀察結果的訓練數據集;精細調整模型對訓練數據質量和數量非常敏感。我們還需要做出配置決策(周期數、學習率等),協調長時間運行的培訓作業並跟蹤模型版本。一些基礎模型提供者提供API來抽象掉部分復雜性,而另一些則沒有。

雖然使用精細調整後的模型可能會使推斷變得更便宜,但成本高昂的培訓作業可能會超過其優點[6] 。此外,某些基礎模型提供商(如OpenAI)僅支持滯後邊緣(lagging-edge) 模式下進行微調(因此不支持ChatGPT或GPT-4)。

[13] ChatVD1500 0.0118/0.3320 Sat Jun 03 2023 11:46:46 GMT+0800 (台北標準時間)


[14] Comment: 

Fine-tuning allows an LLM to understand enterprise-specific concepts without including them in each prompt. We take a foundation model, which already encodes general knowledge across billions of learned parameters, and tweak those parameters to reflect specific enterprise knowledge, while still retaining the underlying general knowledge.[5] When we generate inferences with the new fine-tuned model, we get that enterprise knowledge “for free”.

[14] 0/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[15] Comment: 

微調(Fine-Tuning)允許LLM理解企業特定概念,而無需在每個提示中包含它們。我們使用基礎模型,該模型已經將通用知識編碼為數十億個學習參數,並調整這些參數以反映特定的企業知識,同時仍保留基礎的通用知識。[5]當我們使用新的微調模型生成推論時,我們可以“免費”獲得企業知識。

[15] ChatVD1500 0.0109/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[16] Comment: 

In contrast to embeddings/prompt engineering, where the underlying model is a third-party black box, fine-tuning is closer to classical machine learning, where ML teams created their own models from scratch. Fine-tuning requires a training dataset with labeled observations; the fine-tuned model is highly sensitive to the quality and volume of that training data. We also need to make configuration decisions (number of epochs, learning rate, etc), orchestrate long-running training jobs, and track model versions. Some foundation model providers provide APIs that abstract away some of this complexity, some do not.

[16] 0/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[17] Comment: 

與嵌入/提示工程不同,微調更接近於經典的機器學習,其中ML團隊從頭開始創建自己的模型。微調需要有標記觀測數據集;微調後的模型對訓練數據質量和數量非常敏感。我們還需要做出配置決策(例如輪數、學習率等),編排長時間運行的訓練作業,並跟蹤模型版本。一些基礎模型提供商提供了抽象掉部分復雜性的API,而另一些則沒有這樣做。

[17] ChatVD1500 0.0115/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[18] Comment: 

While inferences may be cheaper with fine-tuned models, that can be outweighed by costly training jobs.[6] And some foundation model providers (like OpenAI) only support fine-tuning of lagging-edge models (so not ChatGPT or GPT-4 but older text-davinci-003).

[18] 0/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[19] Comment: 

雖然微調模型的推論可能更便宜,但其代價會被昂貴的訓練作業所抵消。[6] 一些基礎模型提供商(如OpenAI)僅支持對滯後模型進行微調(而不是ChatGPT或GPT-4)。

[19] ChatVD1500 0.0113/0.3320 Sat Jun 03 2023 11:46:45 GMT+0800 (台北標準時間)


[20] Comment: 

Evals

One of the novel, significant challenges presented by LLMs is measuring the quality of complex outputs. Classical ML teams have tried-and-true methods for measuring the accuracy of simple outputs, like numerical predictions or categorizations. But most enterprise use cases for LLMs involve generating responses that are tens to thousands of words. Concepts sophisticated enough to require more than ten words can normally be worded in many ways. So even if we have a human-validated “expert” response, doing an exact string match of a model response to the expert response is too stringent a test, and would underestimate model response quality.

[20] 0/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[21] Comment: 

Evals LLMs帶來的一個新穎而重要的挑戰是如何衡量復雜輸出的質量。經典機器學習團隊已經有了可靠的方法來測量簡單輸出(例如數字預測或分類)的準確性。但大多數LLMs用於企業應用場景,需要生成長度為十到數千個單詞不等的響應。對於那些需要超過十個單詞才能表達清晰且復雜概念,通常可以使用許多方式進行表述。因此,即使我們擁有一個由人類驗證過“專家”響應,將模型響應與專家響應進行精確字元串匹配是一種太嚴格、會低估模型響應質量的測試方法。

[21] ChatVD1500 0.0123/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[22] Comment: 

Translate to Chinese:

The Evals framework, open-sourced by OpenAI, is one approach to tackling this problem. This framework requires a labeled test set (where prompts are matched to “expert” responses), but it allows broad types of comparison between model and expert responses. For example, is the model-generated answer: a subset or superset of the expert answer; factually equivalent to the expert answer; more or less concise than the expert answer? The caveat is that Evals perform these checks using an LLM. If there’s a flaw in the “checker” LLM, the eval results may themselves be inaccurate.

[22] 0/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[23] Comment: 

Evals框架是OpenAI開源的一種解決這個問題的方法。該框架需要一個標記過的測試集(其中提示與“專家”響應匹配),但它允許模型和專家響應之間進行廣泛類型的比較。例如,模型生成的答案是否是專家答案的子集或超集;與專家答案相等;比專家答案更簡潔還是更詳細?注意,Evals使用LLM執行這些檢查。如果“檢查器”LLM存在缺陷,則評估結果本身可能不準確。

OpenAI 'Evals' is an open-source framework developed by OpenAI for the automated evaluation of AI model performance. It allows anyone to report shortcomings in OpenAI's models, which can be used to guide further improvements to these models​1​.

OpenAI“Evals”是 OpenAI 開發的用於自動評估 AI 模型性能的開源框架。它允許任何人報告 OpenAI 模型中的缺點,這些缺點可用於指導對這些模型的進一步改進 1 。

[23] ChatVD1500 0.0126/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[24] Comment: 

Adversarial Examples

If you’re using an LLM in production, you need to have confidence that it will handle misguided or malicious user inputs safely. For most enterprises, the starting point is ensuring the model doesn’t spread false information. That means a system that knows its own limitations and when to say “I don’t know.” There are many tactical approaches here. It can be done via prompt engineering with prompt language like “Respond ‘I don’t know’ if the question cannot be answered with the context provided above”. It can be done with fine-tuning, by providing out-of-scope training examples, where the expert response is “I don’t know”.

[24] 0/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[25] Comment: 

對於在生產環境中使用LLM的企業,需要確信它能夠安全地處理誤導性或惡意用戶輸入。對於大多數企業來說,起點是確保模型不會傳播虛假信息。這意味著需要一個系統知道自己的局限性,並在必要時表明“我不知道”。這里有許多策略性方法。可以通過提示工程進行操作,例如使用提示語言“如果上述情況無法回答問題,請回復‘我不知道’”。也可以通過微調進行操作,提供超出範圍的訓練示例,在專家響應為“我不知道”的情況下進行。

[25] ChatVD1500 0.0133/0.3320 Sat Jun 03 2023 11:46:44 GMT+0800 (台北標準時間)


[26] Comment: 

Enterprises also need to guard against malicious user inputs, e.g. prompt hacking. Limiting the format and length of the system’s acceptable inputs and outputs can be an easy and effective start. Precautions are a good idea if you’re only serving internal users and they’re essential if you’re serving external users.

[26] 0/0.3320 Sat Jun 03 2023 11:46:43 GMT+0800 (台北標準時間)


[27] Comment: 

Refined by ChatLKK:

Enterprises also need to guard against malicious user inputs, such as prompt hacking. One effective way to do so is by limiting the format and length of acceptable inputs and outputs in the system. This can be a simple yet effective start to prevent potential security breaches. Precautions are particularly important when serving external users, but they are also recommended for internal users as well. It's crucial for organizations to take proactive measures in ensuring the security of their LLM systems and protecting sensitive data from any unauthorized access or manipulation.

[27] ChatVD1500 0.0131/0.3320 Sat Jun 03 2023 11:46:43 GMT+0800 (台北標準時間)


[28] Comment: 

Translate to Chinese:

Three Concerns

Bias Perpetuation

The developers of the most popular LLMs (OpenAI / GPT-4, Google / Bard) have taken pains to align their models with human preferences and deploy sophisticated moderation layers. If you ask GPT-4 or Bard to tell you a racist or misogynistic joke, they will politely refuse.[7]

[28] 0/0.3320 Sat Jun 03 2023 11:46:43 GMT+0800 (台北標準時間)


[29] Comment: 

三個關註點

偏見持續存在

最受歡迎的LLM(OpenAI/GPT-4,Google/Bard)的開發者們一直在努力將其模型與人類偏好保持一致,並部署了復雜的審查層。如果你讓GPT-4或Bard給你講一個種族主義或厭女笑話,它們會禮貌地拒絕。[7]

[29] ChatVD1500 0.0135/0.3320 Sat Jun 03 2023 11:46:43 GMT+0800 (台北標準時間)


[30] Comment: 

Translate to Chinese:

That’s good news. The bad news is that this moderation, which targets societal biases, doesn’t necessarily prevent institutional biases. Imagine our customer support team has a history of being rude to a particular type of customer. If historical customer support conversations are naively used to construct a new AI system (for example, via fine-tuning) that system is likely to replicate that bias.

[30] 0/0.3320 Sat Jun 03 2023 11:46:43 GMT+0800 (台北標準時間)


[31] Comment: 

這是個好消息。壞消息是,這種針對社會偏見的審查並不能完全防止機構內部的偏見。想象一下我們的客戶支持團隊曾經對某一類客戶表現出不禮貌的行為。如果歷史上的客服交流被天真地用來構建一個新的AI系統(例如通過微調),那麼該系統很可能會復制這種偏見。

[31] ChatVD1500 0.0138/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[32] Comment: 

Translate to Chinese:

If you’re using past data to train an AI model (be it a classical model or a generative model), closely scrutinize which past situations you want to perpetuate into the future and which you do not. Sometimes it’s easier to set principles and work from those (for example, via prompt engineering), without using past data directly.

[32] 0/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[33] Comment: 

如果您正在使用過去的數據來訓練AI模型(無論是經典模型還是生成模型),請仔細審查您想要將哪些過去情況延續到未來,哪些不需要。有時候,通過制定原則並從中工作可能更容易(例如通過提示工程, 0 shot),而不直接使用過去的數據 (through fineturning)。

[33] ChatVD1500 0.0140/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[34] Comment: 

Translate to Chinese:

Model Lock-In

Unless you’ve been living under a rock, you know generative AI models are advancing incredibly rapidly. Given an enterprise use case, the best LLM for it today may not be the best solution in six months and almost certainly will not be the best solution in six years. Smart ML teams know they will need to switch models at some point.

[34] 0/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[35] Comment: 

模型鎖定問題

除非您一直生活在岩石下麵,否則您應該知道生成式AI模型正在以驚人的速度發展。對於企業使用案例而言,今天最適合的LLM可能在六個月後甚至六年後都不是最佳解決方案。聰明的ML團隊知道他們需要在某個時候切換模型。

[35] ChatVD1500 0.0143/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[36] Comment: 

Translate to Chinese:

But there are two other major reasons to build for easy LLM “swapping”. First, many foundation model providers have struggled to support exponentially-growing user volume, leading to outages and degraded service. Building a fallback foundation model into your system is a good idea. Second, it can be quite useful to test multiple foundation models in your system (“a horse race”) to get a sense of which performs best. Per the section above on Evals, it’s often difficult to measure model quality analytically, so sometimes you just want to run two models and qualitatively compare the responses.

[36] 0/0.3320 Sat Jun 03 2023 11:46:42 GMT+0800 (台北標準時間)


[37] Comment: 

但是,還有兩個主要原因需要為易於LLM“切換”而建立。首先,許多基礎模型提供商一直在努力支持呈指數增長的用戶數量,導致服務中斷和降級。將備用基礎模型構建到系統中是一個不錯的選擇。其次,在您的系統中測試多個基礎模型(“馬賽克比賽”)可以很有用,以瞭解哪種表現最佳。根據上面關於Evals的部分所述,經常難以通過分析來衡量模型質量,因此有時您只想運行兩個模型並對響應進行定性比較。

[37] ChatVD1500 0.0151/0.3320 Sat Jun 03 2023 11:46:41 GMT+0800 (台北標準時間)


[38] Comment: 

Translate to Chinese:

Data Leakage

Read the terms and conditions of any foundation model you’re considering using. If the model provider has the right to use user inputs for future model training, that’s worrisome. LLMs are so large it’s possible that specific user queries/responses become directly encoded in a future model version, and could then become accessible to any user of that version. Imagine a user at your organization queries “how can I clean up this code that does XYZ? [your proprietary, confidential code here]” If this query is then used by the model provider to retrain their LLM, that new version of the LLM may learn that your proprietary code is a great way to solve use case XYZ. If a competitor asks how to do XYZ, the LLM could “leak” your source code, or something very similar.

[38] 0/0.3320 Sat Jun 03 2023 11:46:41 GMT+0800 (台北標準時間)


[39] Comment: 

數據泄漏

閱讀您考慮使用的任何基礎模型的條款和條件。如果模型提供商有權將用戶輸入用於未來的模型訓練,那就令人擔憂。LLMs非常龐大,特定用戶查詢/響應可能會直接編碼到未來版本的模型中,並且隨後可以被該版本的任何用戶訪問。想象一下,您組織中的一個用戶查詢“如何清理執行XYZ操作的代碼?[在此處插入您專有、保密代碼]” 如果這個查詢被用於重新訓練其LLM,那麼新版本LLM可能會學習到您專有代碼是解決Use Case XYZ問題的好方法。如果競爭對手詢問如何做XYZ,則LLM可能會“泄露”您源代碼或類似內容。

[39] ChatVD1500 0.0158/0.3320 Sat Jun 03 2023 11:46:41 GMT+0800 (台北標準時間)


[40] Comment: 

Translate to Chinese:

OpenAI now allows users to opt-out of their data being used to train models, which is a good precedent, but not every model provider has followed their example. Some organizations are also exploring running LLMs within their own virtual private clouds; this is a key reason for much of the interest in open-source LLMs.

[40] 0/0.3320 Sat Jun 03 2023 11:46:41 GMT+0800 (台北標準時間)


[41] Comment: 

OpenAI現在允許用戶選擇退出使用其數據來訓練模型,這是一個好的先例,但並非每個模型提供商都遵循他們的例子。一些組織也正在探索在自己的虛擬私有雲中運行LLMs;這是開源LLM引起廣泛關註的主要原因之一。

[41] ChatVD1500 0.0154/0.3320 Sat Jun 03 2023 11:46:41 GMT+0800 (台北標準時間)


[42] Comment: 

Translate to Chinese:

Two Hot-Takes

Prompt Engineering Will Dominate Fine Tuning When I first started adapting LLMs for enterprise use, I was much more interested in fine tuning than prompt engineering. Fine tuning felt like it adhered to the principles of classical ML systems to which I was accustomed: wrangle some data, produce a train/test dataset, kick off a training job, wait a while, evaluate the results against some metric.

[42] 0/0.3320 Sat Jun 03 2023 11:46:40 GMT+0800 (台北標準時間)


[43] Comment: 

兩個熱門觀點 提示工程將主導微調 當我開始為企業使用適應LLM時,我對微調比提示工程更感興趣。微調感覺像是遵循了我習慣的經典ML系統的原則:處理一些數據,生成一個訓練/測試數據集,啟動訓練作業,等待一段時間,並根據某些指標評估結果。

[43] ChatVD1500 0.0158/0.3320 Sat Jun 03 2023 11:46:40 GMT+0800 (台北標準時間)


[44] Comment: 

Translate to Chinese:

But I’ve come to believe that prompt engineering (with embeddings) is a better approach for most enterprise use cases. First, the iteration cycle for prompt engineering is far faster than for fine tuning, because there is no model training, which can take hours or days. Changing a prompt and generating new responses can be done in minutes. Conversely, fine-tuning is an irreversible process in terms of model training; if you used incorrect training data or a better base model comes out, you need to restart your fine-tuning jobs. Second, prompt engineering requires far less knowledge of ML concepts like neural network hyperparameter optimization, training job orchestration or data wrangling. Fine-tuning often requires experienced ML engineers, while prompt engineering can often be done by software engineers without ML experience. Third, prompt engineering works better for the fast-growing strategy of model chaining, in which complex requests are decomposed into smaller, constituent requests, each of which can be assigned to a different LLM. Sometimes the best “constituent model” is a fine-tuned model.[8] But most of the value-add work for enterprises is (i) figuring out how to break apart their problem, (ii) write the prompts for each constituent part, and (iii) identify the best off-the-shelf model for each part; it’s not in creating their own fine-tuned models.

[44] 0/0.3320 Sat Jun 03 2023 11:46:40 GMT+0800 (台北標準時間)


[45] Comment: 

但我相信,在大多數企業應用場景中,使用提示工程(帶嵌入)是更好的方法。首先,與微調相比,提示工程的迭代周期要快得多,因為沒有模型訓練,這可能需要幾個小時或幾天。更改提示並生成新響應可以在幾分鐘內完成。相反,在模型訓練方面,微調是一個不可逆轉的過程;如果您使用了不正確的訓練數據或出現了更好的基礎模型,則需要重新啟動微調作業。其次,提示工程對於神經網路超參數優化、培訓作業編排或數據整理等ML概念要求較少。微調通常需要有經驗的ML工程師來完成,而軟體工程師通常可以進行無需ML經驗即可完成提示工程。第三,在將復雜請求分解為較小組成部分並將每個組成部分指定給不同LLM 的快速增長策略中 ,提示工程效果更佳 。有時最好的“組成模型”是精細調整後的模型[8] 。但企業價值增加大部分都在於(i)找到如何解決他們遇到問題;(ii) 為每個組成部分編寫提醒,(iii)識別最佳現貨市場上每一部分機器學習演算法;而不是創建自己精細調整後的模型.

[45] ChatVD1500 0.0031/0.3320 Sat Jun 03 2023 11:46:39 GMT+0800 (台北標準時間)


[46] Comment: 

Translate to Chinese:

The advantages of prompt engineering are likely to widen over time. Today, prompt engineering requires long, expensive prompts (since context must be included in each prompt). But I’d bet on rapidly declining cost per token, as the model provider space gets more competitive and providers figure out how to train LLMs more cheaply. Prompt engineering is also limited today by maximum prompt sizes — but , OpenAI already accepts 32K tokens (~40 pages of average English text) per prompt for GPT-4, and Anthropic’s Claude accepts 100K tokens (~150 pages). And I’d bet on even larger context windows coming out in the near future.

[46] 0/0.3320 Sat Jun 03 2023 11:46:36 GMT+0800 (台北標準時間)


[47] Comment: 

提示工程的優勢可能隨著時間的推移而擴大。今天,提示工程需要長而昂貴的提示(因為每個提示必須包含上下文)。但我敢打賭,隨著模型供應商市場變得更加競爭激烈和供應商找到更便宜地訓練LLM的方法,每個標記成本會迅速下降。提示工程今天還受到最大提示大小的限制 - 但是,OpenAI已經接受32K令牌(約40頁平均英文文本)用於GPT-4的每個提示,並且Anthropic's Claude接受100K令牌(約150頁)。我敢打賭,在不久的將來會出現更大的上下文視窗。

[47] ChatVD1500 0.0029/0.3320 Sat Jun 03 2023 11:46:36 GMT+0800 (台北標準時間)


[48] Comment: 

Translate to Chinese:

Data Won’t Be The Moat It Once Was

As LLMs have become better at producing human-interpretable reasoning, its useful to consider how humans use data to reason, and what that implies for LLMs.[9] Humans don’t actually use much data! Most of the time, we do “zero shot learning”, which simply means we answer questions without the question being accompanied by a set of example question-answer pairs. The questioner just provides the question, and we answer based on logic, principles, heuristics, biases, etc.

[48] 0/0.3320 Sat Jun 03 2023 11:46:36 GMT+0800 (台北標準時間)


[49] Comment: 

數據不再是過去的護城河

隨著LLMs越來越擅長產生人類可解釋推理,考慮人類如何使用數據進行推理以及這對LLMs意味著什麼就變得很有用了。[9] 事實上,人們並沒有真正使用太多的數據!大部分時間我們都在進行“零樣本學習”,也就是說我們可以回答問題而無需伴隨一組示例問題-答案對。提問者只需要提供問題,我們依據邏輯、原則、啟發式、偏見等回答。

[49] ChatVD1500 0.0018/0.3320 Sat Jun 03 2023 11:46:35 GMT+0800 (台北標準時間)


[50] Comment: 

這一段是操作失誤生成出來的

Transfer Learning

Transfer learning is the idea that a model trained on one task can be fine-tuned for another task. This has been shown to work well with LLMs, where a pre-trained language model can be fine-tuned for specific tasks such as question answering or sentiment analysis. The benefit of transfer learning is that it allows models to learn from large amounts of data and generalize to new tasks more effectively.

Adversarial Examples

Adversarial examples are inputs designed to fool machine learning models. For example, an image classification model may classify an image of a panda correctly, but if a small amount of noise is added to the image, the same model may classify it as something completely different like a gibbon. Adversarial examples pose a challenge for LLMs because they rely heavily on statistical patterns in data, making them vulnerable to subtle manipulations.

Explainability and Interpretability

Explainability and interpretability refer to the ability of machine learning models (including LLMs) to provide clear explanations or insights into their decision-making processes. While LLMs have made significant progress in producing human-interpretable reasoning, there is still much work needed in developing methods for explaining how these models arrive at their conclusions. This is especially important when dealing with sensitive applications such as healthcare or finance where decisions based on opaque algorithms could have serious consequences.

[50] ChatVD1500 0.0016/0.3320 Sat Jun 03 2023 11:46:35 GMT+0800 (台北標準時間)


[51] Comment: 

This is different from the LLMs of just a few years ago, which were only good at few-shot learning, where you needed to include a handful of example question-answer pairs in your prompt. And it’s very different from classical ML, where the model needed to be trained on hundreds, thousands, or millions of question-answer pairs.

[51] 0/0.3320 Sat Jun 03 2023 11:46:35 GMT+0800 (台北標準時間)


[52] Comment: 

Translate to Chinese:

I strongly believe that an increasing, dominant share of LLM use cases will be “zero-shot”. LLMs will be able to answer most questions without any user-provided examples. They will need prompt engineering, in the form of instructions, policies, assumptions, etc. For example, this post uses GPT-4 to review code for security vulnerabilities; the approach requires no data on past instances of vulnerable code. Having clear instructions, policies, and assumptions will become increasingly important — but having large volumes of high-quality, labeled, proprietary data will become less important.

[52] 0/0.3320 Sat Jun 03 2023 11:46:35 GMT+0800 (台北標準時間)


[53] Comment: 

我堅信,LLM使用案例的數量和占比將不斷增加,其中“零樣本學習”將成為主流。LLM能夠回答大多數問題而無需用戶提供任何示例。它們需要通過提示工程來進行指導、策略、假設等形式的引導。例如,本文使用GPT-4來審核安全漏洞代碼;這種方法不需要過去存在有漏洞代碼的數據。擁有清晰明確的指令、政策和假設將變得越來越重要——但是擁有大量高質量標記的專有數據將變得不那麼重要。

[53] ChatVD1500 0.0013/0.3320 Sat Jun 03 2023 11:46:34 GMT+0800 (台北標準時間)


[54] Comment: 

If you’re actively working on applying LLMs to your enterprise data, I’d love to hear about what you’ve found works and what does not. Please leave a comment!

[54] 0/0.3320 Sat Jun 03 2023 11:46:34 GMT+0800 (台北標準時間)


[55] Comment: 

Translate to Chinese:

Footnotes

[1] Until recently, LLMs were also unaware of recent public knowledge — for example GPT-4 was trained on information collected through Sept 2021. However, the consumer interfaces for GPT-4 and Bard are now able to query the open internet and collect information about recent events. So recency is quickly fading as a knowledge limitation for LLMs.

[2] Embeddings can work on all kinds of data structures, not just text.

[3] The entire embedding workflow occurs prior to calling the LLM. For example, OpenAI recommends using its ada-002 model for embeddings, which is cheaper and faster than any of the leading-edge GPT models.

[4] Tokens are words or word parts. Here’s a good explanation of why language models use tokens, not words.

[5] Learned parameter count could be anywhere from millions to trillions. Most widely-used LLMs today have billions.

[6] Cheaper inferences are not a given; OpenAI charges $0.03–0.06 per 1k tokens for GPT-4 with an 8K context window (depending on whether the tokens are inputs or outputs, respectively). It charges $0.12 per 1k tokens for a fine-tuned version of Davinci, a lagging-edge model.

[7] Of course, these are humans employed by OpenAI and Google. And since a lot of people disagree with those organizations’ values, they disagree with the moderation policies.

[8] For example, GOAT is a version of the open-source model LLaMA fine-tuned for arithmetic. It outperforms GPT-4 on many arithmetic benchmarks. Most enterprises have workflows that require arithmetic; under the chaining approach, the parts of the workflow that involve arithmetic would be identified and routed to GOAT. It makes sense for such an enterprise to invest heavily in good routing and integration with GOAT, but, in my opinion, not to fine-tune their own arithmetic LLM.

[9] There is much debate about whether today’s LLMs can actually reason, and what actual reasoning even means (does it require consciousness? self-awareness? agency?) Sidestepping that debate, which is more philosophical than empirical, it’s worth noting that LLMs are unequivocally getting better at producing explanations that conform to widely-held notions of good reasoning; this paper has many great examples.

Machine Learning AI Product Management Data Science

[55] 0/0.3320 Sat Jun 03 2023 11:46:33 GMT+0800 (台北標準時間)


[56] Comment: 

腳註

[1] 直到最近,LLM也不知道最新的公共信息——例如GPT-4是根據2021年9月之前收集的信息進行訓練的。然而,GPT-4和Bard的消費者界面現在能夠查詢開放網際網路並收集有關最新事件的信息。因此,對於LLM來說,時效性很快就不再是一個知識限制。

[2] 嵌入可以處理各種數據結構,而不僅僅是文本。

[3] 整個嵌入工作流程都發生在調用LLM之前。例如,OpenAI建議使用其ada-002模型進行嵌入,該模型比任何領先的GPT模型更便宜、更快。

[4] 標記是單詞或單詞部分。這里有一個很好的解釋為什麼語言模型使用標記而不是單詞。

[5] 學習參數數量可能從數百萬到數萬億不等。今天大多數廣泛使用的LLM都有數十億個學習參數。

[6] 更便宜的推斷結果並非必然;OpenAI針對具有8K上下文視窗大小(取決於輸入還是輸出) 的 GPT-4 每 1000 個標記收取 $0.03 - $0.06 不等 。對於落後模型Davinci經過優化版本,則每千個標記收費$0.12.

[7] 當然,這些人都受雇於OpenAI和Google公司。由於許多人反對這些組織價值觀,他們也會反對審核政策。

[56] ChatVD1500 0.0044/0.3320 Sat Jun 03 2023 11:46:33 GMT+0800 (台北標準時間)


[57] Comment: 

Footnotes

[1] Until recently, LLMs were also unaware of recent public knowledge — for example GPT-4 was trained on information collected through Sept 2021. However, the consumer interfaces for GPT-4 and Bard are now able to query the open internet and collect information about recent events. So recency is quickly fading as a knowledge limitation for LLMs.

[2] Embeddings can work on all kinds of data structures, not just text.

[3] The entire embedding workflow occurs prior to calling the LLM. For example, OpenAI recommends using its ada-002 model for embeddings, which is cheaper and faster than any of the leading-edge GPT models.

[4] Tokens are words or word parts. Here’s a good explanation of why language models use tokens, not words.

[5] Learned parameter count could be anywhere from millions to trillions. Most widely-used LLMs today have billions.

[6] Cheaper inferences are not a given; OpenAI charges $0.03–0.06 per 1k tokens for GPT-4 with an 8K context window (depending on whether the tokens are inputs or outputs, respectively). It charges $0.12 per 1k tokens for a fine-tuned version of Davinci, a lagging-edge model.

[7] Of course, these are humans employed by OpenAI and Google. And since a lot of people disagree with those organizations’ values, they disagree with the moderation policies.

[57] 0/0.3320 Sat Jun 03 2023 11:46:32 GMT+0800 (台北標準時間)


[58] Comment: 

註釋

[1] 直到最近,LLMs也不知道最新的公共知識-例如GPT-4是在2021年9月之前收集信息進行訓練的。然而,現在GPT-4和Bard的消費者界面能夠查詢開放網際網路並收集關於最近事件的信息。因此,對於LLM來說,時效性正在迅速成為一種知識限制。

[2] 嵌入可以處理各種數據結構,而不僅僅是文本。

[3] 整個嵌入工作流程發生在調用LLM之前。例如,OpenAI建議使用其ada-002模型進行嵌入處理,該模型比任何領先的GPT模型更便宜且更快。

[4] 令牌是單詞或單詞部分。以下是有關語言模型為什麼使用標記(tokens)而非單詞(words)的良好解釋。

[5] 學習參數計數可能介於幾百萬到數萬億之間。今天大多數廣泛使用的LLMs都具有數十億個學習參數。

[6] 更便宜的推論並不是確定性結果;OpenAI針對具有8K上下文視窗大小(根據輸入或輸出標記) 的 GPT - 4 收取每千個標記0.03美元至0.06美元(取決於是否為輸入或輸出),針對落後版本Davinci微調版則需收取每千個標記0.12美元。

[7] 當然這些人員都由OpenAI和Google雇傭管理著, 因此很多人與這些組織價值觀存在分歧,並反對他們所採用審查政策.

[8 ]例如GOAT 是經過fine-tuned LLaMA算術版本中表現優異,在許多算術基準測試中超越了 GPT - 4 。 大多數企業擁有需要算術運算工作流程; 在鏈接方法下, 涉及算術操作部分將被確定並路由至 GOAT 。 對於這樣一個企業來說,在良好地路由和與GOAT整合方面投資

[58] ChatVD1500 0.0176/0.3320 Sat Jun 03 2023 11:46:29 GMT+0800 (台北標準時間)


[59] Comment: 

Translate to Chinese:

[8] For example, GOAT is a version of the open-source model LLaMA fine-tuned for arithmetic. It outperforms GPT-4 on many arithmetic benchmarks. Most enterprises have workflows that require arithmetic; under the chaining approach, the parts of the workflow that involve arithmetic would be identified and routed to GOAT. It makes sense for such an enterprise to invest heavily in good routing and integration with GOAT, but, in my opinion, not to fine-tune their own arithmetic LLM.

[9] There is much debate about whether today’s LLMs can actually reason, and what actual reasoning even means (does it require consciousness? self-awareness? agency?) Sidestepping that debate, which is more philosophical than empirical, it’s worth noting that LLMs are unequivocally getting better at producing explanations that conform to widely-held notions of good reasoning; this paper has many great examples.

Machine Learning AI Product Management Data Science

[59] ChatVD1500 0.0176/0.3320 Sat Jun 03 2023 11:46:28 GMT+0800 (台北標準時間)


[60] Comment: 

[8] 例如,GOAT是針對算術進行微調的開源模型LLaMA的一個版本。它在許多算術基準測試中優於GPT-4。大多數企業都有需要算術的工作流程;在鏈接方法下,涉及算術部分的工作流程將被識別並路由到GOAT。這樣的企業應該重點投資於良好的路由和與GOAT集成,但我認為不必微調他們自己的演算法LLM。

[9] 關於今天的LLM是否真正能夠推理以及實際推理意味著什麼(是否需要意識?自我意識?代理性?)存在很多爭議。繞過這場更具哲學性而非實證性質疑,值得注意的是,LLMs毫無疑問正在變得越來越擅長產生符合廣泛接受良好推理觀念解釋;本文有許多很好例子。

機器學習 人工智慧 產品管理 數據科學

[60] ChatVD1500 0.0019/0.3320 Sat Jun 03 2023 11:46:27 GMT+0800 (台北標準時間)


[61] User: 

[61] 0/0.3320 Sat Jun 03 2023 11:49:21 GMT+0800 (台北標準時間)