QTM 350 - Data Science Computing

Lecture 11 - Introduction to AI-Assisted Programming

Danilo Freire

Emory University

07 October, 2024

I hope you’re having a lovely day! 😊

Some comments on Quiz 2 📚

A few notes on Quiz 2

  • I haven’t graded Quiz 2 yet, but I have seen most of your projects…
  • … and I’m happy with them! 😊
  • Many of you wrote and published the website successfully, which is great!
  • You indeed used different themes and styles, and the analyses were correct
  • However, I noticed a few common mistakes that I’d like to address 😉
  • Most of the mistakes were actually not related to this course’s content at all!
  • They were mainly about something that is very important in computing in general…
    • Folder structure!
  • Other common mistakes were:
    • Small typos in the code (e.g., missing a comma, forgetting to close the code chunk with ```, etc.)
    • Issues with pushing the changes to GitHub
  • Feedback will soon be available on Canvas 😉

Some tips on folder and project management

  • macOS organises files and applications in a hierarchical folder structure
  • Key system folders: Applications, Library, System, Users
  • The Users folder contains individual home folders for each user
  • Your home folder (e.g., /Users/yourusername/) contains personal folders like Documents, Desktop, Downloads - Access folders via Terminal:
    • Open Terminal (Applications > Utilities > Terminal)
    • Use cd command to navigate: cd ~/Documents
    • List contents with ls command
    • View current location with pwd command
  • Common shortcuts:
    • ~ represents your home folder
    • / represents the root directory

Some tips on folder and project management

  • It is a good idea to create a single github folder to download and manage all your Emory projects
  • You should put the folder in your home directory or in Documents
  • Inside the github folder, create a folder for each course and separate folders for each project or quiz
  • For this course, you could have a structure like this:
  • Documents
    • github
      • qtm350 (our shared repository)
      • qtm350-quiz01 (forked repository)
      • qtm350-quiz02 (forked repository)
  • If you made a mistake and would like to start over, you can always delete the folder and clone the repository again
    • That’s one of the nicest things about Git and GitHub! 😊

Today’s lecture 🤖

Introduction to AI-Assisted Programming

  • Yes, we all love ChatGPT! 🤖
  • And LLMs (Large Language Models) are indeed changing the way we code
  • This new paradigm is called AI-Assisted Programming
  • It is not about replacing programmers, but about making them more productive (at least for now 😅)
  • In this lecture, we will discuss the main concepts behind AI-Assisted Programming
    • What LLMs are
    • How they can help us write code
    • Limits and challenges
    • How to get started with GitHub Copilot

What are LLMs?

  • LLMs are a type of neural network based on the Transformer architecture (that’s the T in GPT - Generative Pre-trained Transformer)
  • Many important ideas behind neural networks were developed in the 1950s and 1960s (!), but the area has recently exploded due to the availability of large datasets and powerful GPUs
  • LLMs are trained on large corpora of text data (e.g., books, articles, websites, etc.), and they learn to predict the next word in a sentence
  • For code, they are trained on large repositories like GitHub and use Natural Language Processing (NLP) to understand the context and generate code snippets
  • Which means that they are particularly good at writing Python or JavaScript, but they can also help with other languages
  • For a very good introduction to LLMs, I strongly recommend this article by Stephen Wolfram

What are LLMs?

  • So far, LLMs have made tremendous progress in many areas in a very short time
  • They can generate text, code, music, art, and even new scientific discoveries
  • As we all know, what is remarkable about LLMs is not only the quality of their output, but also the fact that they can be used by anyone, including those with no background in AI or programming
  • Thus, LLMs are democratising programming like never before
  • Remember when we talked about low and high level programming languages? LLMs are taking us to a new level of abstraction
  • LLMs are the culmination of a long process of abstraction in computing, and they are changing the way we think about programming

Source: Taulli (2024).

“Delving” into LLMs 🧠

How do LLMs work?

A very simplified explanation: Sorry mathematicians! 😅

  • The data collected by AI firms are huge and include many languages and styles
  • An overlooked part of the training process is the data cleaning and preprocessing: removing duplicates, correcting errors, and standardising formats
  • Tokenisation is the process of breaking text into smaller units called tokens, and each token is assigned a unique ID
    • For example, “Chatbots are helpful in 2024!” = ["Chat", "bots", "are", "help", "ful", "in", "2024", "!"]
  • The main objective during training is for the model to predict the next token in a sequence based on the tokens that come before it
  • This allows the model to learn language patterns without needing labeled data

How do LLMs work?

  • Models use backpropagation to update their parameters based on the errors in their predictions
    • Backpropagation is a gradient descent algorithm that adjusts the weights of the model to minimise the loss function
  • They also use several methods to prevent overfitting, such as dropout
    • Dropout refers to randomly “dropping out”, or omitting, units (both hidden and visible) during the training process of a neural network
  • After initial training, the model can be fine-tuned on specific datasets to improve its performance in particular domains, such as programming

Why was GPT-3 so special?

  • Transformers changed the field by allowing models to process text in parallel, which made them much faster than previous models
  • The model consists of multiple layers, each performing specific functions to process and generate text
  • GPT-3 was the first model to have 175 billion parameters, which made it the largest model at the time
  • Another important feature was its zero-shot leaning capability, which allowed it to perform tasks without any training data
  • GPT also uses multi-head attention to focus on different parts of the input text at the same time
    • For example, it can focus on the subject and the verb of a sentence simultaneously
  • Finally, the model also scales well with more data and more parameters

AI-Assisted Programming: Benefits

  • The area is still very new, but we can already see some important benefits of AI-Assisted Programming:
  • Minimising search time: According to the 2022 Stack Overflow Developer Survey, 62% of the developers spent more than 30 minutes a day searching for answers, and 25% spent over an hour a day. Users of GitHub Copilot report that they finish tasks 55% faster
  • A 24/7 coding advisor: You can ask questions and get code snippets at any time, and you can also use it to learn new languages. Results are promising
  • Easy IDE integration: You can use AI-Assisted Programming in your favourite IDE, and it will help you with code completion, refactoring, and debugging. GitHub Copilot is available for Visual Studio Code, PyCharm, vim, etc
  • Reflecting your codebase and workspace: New tools allow you to train LLMs on your own codebase and search for files and functions in your workspace, which is a huge benefit for newcomers to a team
  • Assessing code integrity: LLMs can identify bugs, vulnerabilities, run tests, and make suggestions
  • Language translation: You can write code in your language and get it translated to another language. For example, IBM’s Watsonx.ai model understands 115 coding languages based on 1.5 trillion tokens

AI-Assisted Programming: Challenges

Hallucination

  • Generative AI models can produce incorrect or misleading content
  • This can be due to errors in the model, biases or incorrect information in the training data, or the limitations of the model architecture
  • This makes it vital to check the output of these models and not take it at face value. For example, I asked Copilot (which is powered by OpenAI’s GPT-4) to solve a simple quadratic equation, and it very confidently gave me a very wrong answer 😅
  • It provided the answers of \(\frac{1}{2}\) and \(\frac{-5}{4}\) when the correct answers were 0.804 and -1.55.

AI-Assisted Programming: Challenges

Bias

  • AI models can amplify biases present in the training data
  • For instance, I asked an AI to give me the names of 10 famous scientists, and it came up with the following list:
    • Albert Einstein
    • Isaac Newton
    • Marie Curie
    • Charles Darwin
    • Nikola Tesla
    • Galileo Galilei
    • Stephen Hawking
    • Leonardo da Vinci
    • Thomas Edison
    • Ada Lovelace
  • Can you spot the bias?

AI-Assisted Programming: Challenges

Bias

https://www.nature.com/articles/s41598-023-42384-8

AI-Assisted Programming: Challenges

Easy to confuse

  • AI models can be easily confused by small changes in the input
  • AIs have an annoying tendency to be overpolite and agree with you even when you are wrong
  • Thus, they can be easily manipulated (including by bad actors)
  • For example, I asked Copilot what the westernmost point of Europe was, and contradicted its answer

AI-Assisted Programming: Challenges

Intellectual property

  • AI models can generate code that is identical to existing code
  • This raises questions about intellectual property and the ownership of the code
  • This is an ongoing debate in the AI community, and it is not clear how it will be resolved
  • But it does have some important implications for open source software and the sharing of code

AI-Assisted Programming: Challenges

Security

  • AI models can be vulnerable to adversarial attacks and injection attacks
  • As programmers use more AI-generated code, many blindly trust the output of these models and send the code into production
  • In Security Weaknesses of Copilot Generated Code in GitHub, Yujia Fu et al. highlighted the security issues with GitHub Copilot.
  • They evaluated 435 AI-generated code snippets from projects on GitHub, and 35.8% had security vulnerabilities
  • …and this code is all going to the LLMs! 😅

Can we prevent these issues? 🤔

How to prevent some of these issues

Better prompt engineering

  • Prompt engineering is a new buzzword in the AI community
  • It refers to the process of designing the input to an AI model to get the desired output
  • It is, to a large extent, a mix of art and science that involves a lot of trial and error (at least in my experience!)
  • The idea is to guide the model to produce the desired output by providing it with the right context and examples
  • Prompt engineering will never be fully exact simply because models themselves are probabilistic
  • But you can think of a prompt as having four main components:

Source: Tulio (2024).

  • Context: The information you provide to the model
  • Instructions: The task you want the model to perform
  • Input of the model: The data you feed into the model
  • Format: The structure of the prompt

Context

  • It is a good idea to begin your prompt with a sentence or two that clearly defines the task you want the model to perform
  • Creating a persona for the model can also help guide its output
    • Prompt: You are an experienced software engineer specialising in debugging Python code. You are asked to write a function that takes a list of integers and returns the sum of the even numbers.
  • Personal note: I’ve had good results by adopting multiple personas in the same prompt
    • For example, I might ask the model to write a function as if it were a beginner, an intermediate, and an expert programmer and then compare the results

Instructions

  • Your prompt should include at least one clear and concise instruction
  • Fewer instructions are better because they reduce the chances of the model getting confused
  • You can also use examples to guide the model
    • Prompt: Develop a SQL query to retrieve from our database a list of customers who made purchases above $500 in the last quarter of 2023. The query should return the customer’s full name, their email address, the total amount spent, and the date of their last purchase. The results should be sorted by the total amount spent in descending order. Please ensure that the query is optimized for performance.

Input of the model

  • The input of the model is the data you feed into it
  • Use keywords that are relevant to the task you want the model to perform
  • When it comes to coding, you can provide the model with code snippets that it can use as a reference
  • Also pay attention to delimiters and separators, as they can help the model understand the structure of the input
  • For example, you can use comments to provide additional information to the model

Format

  • Finally, you can also be explicit about the format you want the output to be in
  • For example, you can specify that you want the output to be in a specific programming language, or you can ask the model to provide you with a code snippet
  • You can also ask the model to provide you with a list of steps to complete a task
  • This can be useful if you are working on a complex project or with different programming languages

How to get started with GitHub Copilot 🚀

How to get started with GitHub Copilot

  • We will look into Copilot in more detail in the next lecture, but I just want to make sure we all have it installed and ready to go 😉
  • First, it is a good idea to have VS Code installed on your computer
  • Then, you can install the GitHub Copilot extension from the VS Code marketplace
  • You will need to sign in with your GitHub account to use Copilot
  • Copilot works in the CLI as well, and you can install it using this guide. You need to install GitHub CLI first
  • Take some time to install and play around with Copilot before the next lecture 😃
  • Please let me know if you have any questions or issues!

Basic components

  • There are two main components to GitHub Copilot:
    • GitHub Copilot: This provides code suggestions and completions in the editor window
    • GitHub Copilot Chat: This provides a chat interface to Copilot, allowing you to ask questions and get code suggestions. It can also interact with code in the editor window. This is by far the most powerful feature of Copilot
  • Copilot not only offers code suggestions, but it can also help you with writing documentation, providing explantions for code you don’t know, and, more recently, retrieving code snippets and information from your whole workspace
  • In my view, it is the most convenient AI-assistant for programming available today
  • It is also very easy to use

Autocompletion

  • If you start writing code in the editor, and Copilot will suggest completions
  • You can accept these completions by pressing Tab or Enter
  • Alternatively, you can press Ctrl and the right arrow key to select just the next word T
  • This can be helpful if some, but not all, of the suggestion is appropriate
  • This autocompletion will work in a number of different contexts, including code cells in Jupyter Notebooks, or within a .py file.
  • The autocompletion is aware of the code around your cursor, and will suggest completions based on this context I
  • It can use functions already in the code, and can infer a likely purpose of the next piece of code based on the code that has already been written
  • Sometimes, writing a comment will help guide Copilot to suggest relevant code to you

In-Editor Prompting

  • In an empty file/code cell, VS Code will display a greyed-out piece of text as follows:

  • If you press Ctrl + I, a text box will appear into which you can write a prompt C
  • Copilot will suggest new code, or changes to existing code based on this prompt
  • These may be in the form of one or more proposed changes
  • You may choose to accept ach of these changes by clicking the Accept or Discard buttons

In-Editor Prompting

  • You can also highlight an existing piece of code and press Ctrl + I to ask Copilot to suggest changes to that code
  • This can be useful if you have a piece of code that you think could be improved, or if you want to see alternative ways of writing the same code
  • When you do this, you can also click on the button next to Accept and Discard to see which code Copilot is suggesting removing for each change

In-Editor Prompting

  • There are a few standard commands that are available for prompting Copilot to do something when you have a piece of code selected

  • You can access these in the Ctrl + I interface by typing a forward slash, then the name of the command. These are:

  • /doc: This will ask Copilot to generate documentation for the selected code. This will suggest changes in the editor

  • /explain: This will ask Copilot to explain the selected code. It will do this in the Copilot Chat extension

  • /fix will look for problems in the selected code and suggest fixes for them in the editor.

  • /test will ask Copilot to generate tests for the selected code. This will suggest changes in the editor, which may include creating a new file for the tests

  • You can also find these options by right-clicking a highlighted piece of code, and going into the Copilot menu

  • We’ll look at using some of these tools later in the course

Copilot Chat

  • You can also chat to Copilot in a manner closer to that of a chatbot like ChatGPT
  • To do this, you can open the chat window by clicking on the chat icon in the activity bar at the left of the screen

  • You can type a message to Copilot in the window and it will respond, including suggesting code snippets where relevant
  • Copilot chat is limited to discussing programming
  • If Copilot produces a code snippet that you want to use, you can hover over the snippet and click the “Copy” button to copy it to the clipboard, click the “Insert at Cursor” button to insert at the location of the cursor in the editor, or insert it into a new file or the currently active terminal

Exercise

You can work on this exercise on your own time

  • Experiment with some of the ways of using Copilot that we have discussed in this notebook. Include the following activities:
    • Use the Ctrl + I interface to ask Copilot to generate a function from scratch.
    • Use the Ctrl + I interface to ask Copilot a question about an existing piece of code.
    • Use autocomplete to complete a function that you have started writing.
    • Ask Copilot Chat a general question about programming.
    • Ask Copilot Chat a question about a specific piece of code.
    • Insert some code produced by Copilot Chat into the code cell.

That’s all for today! 🎉

Next time we will learn what Copilot can do for us! 🤖

Have a great day! 😊