QTM 350 - Data Science Computing

Lecture 11 - Introduction to AI-Assisted Programming

Danilo Freire

danilo.freire@emory.edu

Emory University

I hope you’re having a lovely day! 😊

Some tips on folder and project management

macOS organises files and applications in a hierarchical folder structure
Key system folders: Applications, Library, System, Users
The Users folder contains individual home folders for each user
Your home folder (e.g., /Users/yourusername/) contains personal folders like Documents, Desktop, Downloads - Access folders via Terminal:
- Open Terminal (Applications > Utilities > Terminal)
- Use cd command to navigate: cd ~/Documents
- List contents with ls command
- View current location with pwd command
Common shortcuts:
- ~ represents your home folder
- / represents the root directory

Some tips on folder and project management

It is a good idea to create a single github folder to download and manage all your Emory projects
You should put the folder in your home directory or in Documents
Inside the github folder, create a folder for each course and separate folders for each project or quiz
For this course, you could have a structure like this:
Documents
- github
  - qtm350 (our shared repository)
  - qtm350-quiz01 (forked repository)
  - qtm350-quiz02 (forked repository)
If you made a mistake and would like to start over, you can always delete the folder and clone the repository again
- That’s one of the nicest things about Git and GitHub! 😊

Today’s lecture 🤖

Introduction to AI-Assisted Programming

Yes, we all love ChatGPT! 🤖
And LLMs (Large Language Models) are indeed changing the way we code
This new paradigm is called AI-Assisted Programming
It is not about replacing programmers, but about making them more productive (at least for now 😅)
In this lecture, we will discuss the main concepts behind AI-Assisted Programming
- What LLMs are
- How they can help us write code
- Limits and challenges
- How to get started with GitHub Copilot

What are LLMs?

LLMs are a type of neural network based on the Transformer architecture (that’s the T in GPT - Generative Pre-trained Transformer)
Many important ideas behind neural networks were developed in the 1950s and 1960s (!), but the area has recently exploded due to the availability of large datasets and powerful GPUs
LLMs are trained on large corpora of text data (e.g., books, articles, websites, etc.), and they learn to predict the next word in a sentence
For code, they are trained on large repositories like GitHub and use Natural Language Processing (NLP) to understand the context and generate code snippets
Which means that they are particularly good at writing Python or JavaScript, but they can also help with other languages
For a very good introduction to LLMs, I strongly recommend this article by Stephen Wolfram

What are LLMs?

So far, LLMs have made tremendous progress in many areas in a very short time
They can generate text, code, music, art, and even new scientific discoveries
As we all know, what is remarkable about LLMs is not only the quality of their output, but also the fact that they can be used by anyone, including those with no background in AI or programming
Thus, LLMs are democratising programming like never before
Remember when we talked about low and high level programming languages? LLMs are taking us to a new level of abstraction
LLMs are the culmination of a long process of abstraction in computing, and they are changing the way we think about programming

Source: Taulli (2024).

“Delving” into LLMs 🧠

How do LLMs work?

A very simplified explanation: Sorry mathematicians! 😅

The data collected by AI firms are huge and include many languages and styles
An overlooked part of the training process is the data cleaning and preprocessing: removing duplicates, correcting errors, and standardising formats
Tokenisation is the process of breaking text into smaller units called tokens, and each token is assigned a unique ID
- For example, “Chatbots are helpful in 2025!” = ["Chat", "bots", "are", "help", "ful", "in", "2025", "!"]
The main objective during training is for the model to predict the next token in a sequence based on the tokens that come before it
This allows the model to learn language patterns without needing labeled data

How do LLMs work?

Models use backpropagation to update their parameters based on the errors in their predictions
- Backpropagation is a gradient descent algorithm that adjusts the weights of the model to minimise the loss function
They also use several methods to prevent overfitting, such as dropout
- Dropout refers to randomly “dropping out”, or omitting, units (both hidden and visible) during the training process of a neural network
After initial training, the model can be fine-tuned on specific datasets to improve its performance in particular domains, such as programming
- Often done by poorly paid workers in developing countries

Why was GPT-3 so special?

Transformers changed the field by allowing models to process text in parallel, which made them much faster than previous models
The model consists of multiple layers, each performing specific functions to process and generate text
GPT-3 was the first model to have 175 billion parameters, which made it the largest model at the time
Another important feature was its zero-shot leaning capability, which allowed it to perform tasks without any training data
GPT also uses multi-head attention to focus on different parts of the input text at the same time
- For example, it can focus on the subject and the verb of a sentence simultaneously
Finally, the model also scales well with more data and more parameters

AI-Assisted Programming: Benefits

The area is still very new, but we can already see some important benefits of AI-Assisted Programming:
Minimising search time: According to the 2022 Stack Overflow Developer Survey, 62% of the developers spent more than 30 minutes a day searching for answers, and 25% spent over an hour a day. Users of GitHub Copilot report that they finish tasks 55% faster
A 24/7 coding advisor: You can ask questions and get code snippets at any time, and you can also use it to learn new languages. Results are promising
Easy IDE integration: You can use AI-Assisted Programming in your favourite IDE, and it will help you with code completion, refactoring, and debugging. GitHub Copilot is available for Visual Studio Code, PyCharm, vim, etc
Reflecting your codebase and workspace: New tools allow you to train LLMs on your own codebase and search for files and functions in your workspace, which is a huge benefit for newcomers to a team
Assessing code integrity: LLMs can identify bugs, vulnerabilities, run tests, and make suggestions
Language translation: You can write code in your language and get it translated to another language. For example, IBM’s Watsonx.ai model understands 115 coding languages based on 1.5 trillion tokens

AI-Assisted Programming: Challenges

Hallucination

Generative AI models can produce incorrect or misleading content
This can be due to errors in the model, biases or incorrect information in the training data, or the limitations of the model architecture
This makes it vital to check the output of these models and not take it at face value. For example, I asked Copilot (which is powered by OpenAI’s GPT-4) to solve a simple quadratic equation, and it very confidently gave me a very wrong answer 😅
It provided the answers of $\frac{1}{2}$ and $\frac{-5}{4}$ when the correct answers were 0.804 and -1.55.

AI-Assisted Programming: Challenges

Bias

AI models can amplify biases present in the training data
For instance, I asked an AI to give me the names of famous scientists, and it came up with the following list:
- Albert Einstein
- Isaac Newton
- Charles Darwin
- Nikola Tesla
- Galileo Galilei
- Stephen Hawking
- Leonardo da Vinci
- Thomas Edison

Can you spot the bias?

AI-Assisted Programming: Challenges

Bias

https://www.nature.com/articles/s41598-023-42384-8

AI-Assisted Programming: Challenges

Easy to confuse

AI models can be easily confused by small changes in the input
AIs have an annoying tendency to be overpolite and agree with you even when you are wrong
Thus, they can be easily manipulated (including by bad actors)
For example, I asked Copilot what the westernmost point of Europe was, and contradicted its answer

AI-Assisted Programming: Challenges

Intellectual property

AI models can generate code that is identical to existing code
This raises questions about intellectual property and the ownership of the code
This is an ongoing debate in the AI community, and it is not clear how it will be resolved
But it does have some important implications for open source software and the sharing of code

More here: https://matthewbutterick.com/chron/this-copilot-is-stupid-and-wants-to-kill-me.html

AI-Assisted Programming: Challenges

Security

AI models can be vulnerable to adversarial attacks and injection attacks
As programmers use more AI-generated code, many blindly trust the output of these models and send the code into production
In Security Weaknesses of Copilot Generated Code in GitHub, Yujia Fu et al. highlighted the security issues with GitHub Copilot.
They evaluated 435 AI-generated code snippets from projects on GitHub, and 35.8% had security vulnerabilities
…and this code is all going to the LLMs! 😅

More here: https://www.techrepublic.com/article/ai-generated-code-outages/

Can we prevent these issues? 🤔

How to prevent some of these issues

Better prompt engineering

Prompt engineering is a new buzzword in the AI community
It refers to the process of designing the input to an AI model to get the desired output
It is, to a large extent, a mix of art and science that involves a lot of trial and error (at least in my experience!)
The idea is to guide the model to produce the desired output by providing it with the right context and examples
Prompt engineering will never be fully exact simply because models themselves are probabilistic
But you can think of a prompt as having four main components:

Source: Tulio (2024).

Context: The information you provide to the model
Instructions: The task you want the model to perform
Input of the model: The data you feed into the model
Format: The structure of the prompt

Context

It is a good idea to begin your prompt with a sentence or two that clearly defines the task you want the model to perform
Creating a persona for the model can also help guide its output
- Prompt: You are an experienced software engineer specialising in debugging Python code. You are asked to write a function that takes a list of integers and returns the sum of the even numbers.
Personal note: I’ve had good results by adopting multiple personas in the same prompt
- For example, I might ask the model to write a function as if it were a beginner, an intermediate, and an expert programmer and then compare the results

Source: https://chatgpt.com/share/6e0c2115-6fd0-497d-a662-44cb0434cee2

Instructions

Your prompt should include at least one clear and concise instruction
Fewer instructions are better because they reduce the chances of the model getting confused
You can also use examples to guide the model
- Prompt: Develop a SQL query to retrieve from our database a list of customers who made purchases above $500 in the last quarter of 2023. The query should return the customer’s full name, their email address, the total amount spent, and the date of their last purchase. The results should be sorted by the total amount spent in descending order. Please ensure that the query is optimized for performance.

Source: https://learn.microsoft.com/en-us/copilot/security/prompting-tips

Input of the model

The input of the model is the data you feed into it
Use keywords that are relevant to the task you want the model to perform
When it comes to coding, you can provide the model with code snippets that it can use as a reference
Also pay attention to delimiters and separators, as they can help the model understand the structure of the input
For example, you can use comments to provide additional information to the model

Source: https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api

Format

Finally, you can also be explicit about the format you want the output to be in
For example, you can specify that you want the output to be in a specific programming language, or you can ask the model to provide you with a code snippet
You can also ask the model to provide you with a list of steps to complete a task
This can be useful if you are working on a complex project or with different programming languages

How to get started with GitHub Copilot 🚀

How to get started with GitHub Copilot

We will look into Copilot in more detail in the next lecture, but I just want to make sure we all have it installed and ready to go 😉
First, it is a good idea to have VS Code installed on your computer
Then, you can install the GitHub Copilot extension from the VS Code marketplace
You will need to sign in with your GitHub account to use Copilot
Copilot works in the CLI as well, and you can install it using this guide. You need to install GitHub CLI first
Take some time to install and play around with Copilot before the next lecture 😃
Please let me know if you have any questions or issues!

https://github.com/features/copilot

Basic components

There are two main components to GitHub Copilot:
- GitHub Copilot: This provides code suggestions and completions in the editor window
- GitHub Copilot Chat: This provides a chat interface to Copilot, allowing you to ask questions and get code suggestions. It can also interact with code in the editor window. This is by far the most powerful feature of Copilot
Copilot not only offers code suggestions, but it can also help you with writing documentation, providing explantions for code you don’t know, and, more recently, retrieving code snippets and information from your whole workspace
In my view, it is the most convenient AI-assistant for programming available today
It is also very easy to use

Autocompletion

If you start writing code in the editor, and Copilot will suggest completions
You can accept these completions by pressing Tab or Enter
Alternatively, you can press Ctrl and the right arrow key to select just the next word T
This can be helpful if some, but not all, of the suggestion is appropriate
This autocompletion will work in a number of different contexts, including code cells in Jupyter Notebooks, or within a .py file.
The autocompletion is aware of the code around your cursor, and will suggest completions based on this context I
It can use functions already in the code, and can infer a likely purpose of the next piece of code based on the code that has already been written
Sometimes, writing a comment will help guide Copilot to suggest relevant code to you

Source: https://github.com/ImperialCollegeLondon/RCDS-introduction-to-AI-assisted-programming/blob/main/resources/autocomplete.gif

In-Editor Prompting

In an empty file/code cell, VS Code will display a greyed-out piece of text as follows:

If you press Ctrl + I, a text box will appear into which you can write a prompt C
Copilot will suggest new code, or changes to existing code based on this prompt
These may be in the form of one or more proposed changes
You may choose to accept ach of these changes by clicking the Accept or Discard buttons

In-Editor Prompting

You can also highlight an existing piece of code and press Ctrl + I to ask Copilot to suggest changes to that code
This can be useful if you have a piece of code that you think could be improved, or if you want to see alternative ways of writing the same code
When you do this, you can also click on the button next to Accept and Discard to see which code Copilot is suggesting removing for each change

In-Editor Prompting

There are a few standard commands that are available for prompting Copilot to do something when you have a piece of code selected
You can access these in the Ctrl + I interface by typing a forward slash, then the name of the command. These are:
/doc: This will ask Copilot to generate documentation for the selected code. This will suggest changes in the editor
/explain: This will ask Copilot to explain the selected code. It will do this in the Copilot Chat extension
/fix will look for problems in the selected code and suggest fixes for them in the editor.
/tests will ask Copilot to generate tests for the selected code. This will suggest changes in the editor, which may include creating a new file for the tests
You can also find these options by right-clicking a highlighted piece of code, and going into the Copilot menu

Copilot Chat

You can also chat to Copilot in a manner closer to that of a chatbot like ChatGPT
To do this, you can open the chat window by clicking on the chat icon in the activity bar at the top right of the screen

You can type a message to Copilot in the window and it will respond, including suggesting code snippets where relevant
If Copilot produces a code snippet that you want to use, you can hover over the snippet and click the “Copy” button to copy it to the clipboard, click the “Insert at Cursor” button to insert at the location of the cursor in the editor, or insert it into a new file or the currently active terminal

Generate documentation

You can ask Copilot to generate documentation for a piece of code by using the /doc command

Use `/tests` to generate tests

Use /tests on a function or snippet, and Copilot Chat will generate relevant test cases

Using `@github` and `#web` to get more specific results

Copilot also has the @github command that allows you to interact with any repository online
This means that you can ask questions about files, folders, structures, and functions in any repository on GitHub
It is powered by Bing, and it can also search the web for information
You can use the #web command to ask Copilot to search the web for information
This can be useful if you are looking for information on a specific topic, or if you want to find out more about a particular function or library

Using `@github` and `#web` to get more specific results

GitHub Codespaces 🖥️

GitHub Codespaces

VS Code in the cloud

GitHub has recently released GitHub Codespaces, which allows you to run Visual Studio Code in your browser for free (up to 120 or 180 hours per month, depending on your plan)
This is a great way to use Copilot if you don’t have a powerful computer, or if you want to work on a project from a different device
It also integrates with GitHub, so you can easily access your repositories from Codespaces
You can also use Codespaces to collaborate with others on a project, as you can share a Codespace with other GitHub users
Codespaces is still in beta, but it works fine! 😊
Read more about it here: https://docs.github.com/en/codespaces/overview

GitHub Codespaces

That’s all for today! 🎉

Next time we will learn how to use APIs and other AI tools! 🤖

QTM 350 - Data Science Computing

I hope you’re having a lovely day! 😊

Some tips on folder and project management

Some tips on folder and project management

Today’s lecture 🤖

Introduction to AI-Assisted Programming

What are LLMs?

What are LLMs?

“Delving” into LLMs 🧠

How do LLMs work?

A very simplified explanation: Sorry mathematicians! 😅

How do LLMs work?

Why was GPT-3 so special?

AI-Assisted Programming: Benefits

AI-Assisted Programming: Challenges

Hallucination

AI-Assisted Programming: Challenges

Bias

AI-Assisted Programming: Challenges

Bias

AI-Assisted Programming: Challenges

Easy to confuse

AI-Assisted Programming: Challenges

Intellectual property

AI-Assisted Programming: Challenges

Security

Can we prevent these issues? 🤔

How to prevent some of these issues

Better prompt engineering

Context

Instructions

Input of the model

Format

How to get started with GitHub Copilot 🚀

How to get started with GitHub Copilot

Basic components

Autocompletion

In-Editor Prompting

In-Editor Prompting

In-Editor Prompting

Copilot Chat

Generate documentation

Use /tests to generate tests

Using @github and #web to get more specific results

Using @github and #web to get more specific results

GitHub Codespaces 🖥️

GitHub Codespaces

VS Code in the cloud

GitHub Codespaces

GitHub Codespaces

GitHub Codespaces

GitHub Codespaces

GitHub Codespaces

That’s all for today! 🎉

Next time we will learn how to use APIs and other AI tools! 🤖

Have a great day! 😊

Use `/tests` to generate tests

Using `@github` and `#web` to get more specific results

Using `@github` and `#web` to get more specific results