LLM Memory and Latency Profiling

Written on December 24, 2024

This post covers profiling an LLM, something I’ve wanted to do for a while.

Google Colab Machine Specs

I’m choosing to use Google Colab because it gives access to an NVidia T4 GPU for free quickly. Here’s a gist where I print information using python libraries to learn more about the environment.

This site gives an overview of /proc/cpuinfo where the data is being read from. The way the contents are generated varies by architecture. Here is a code pointer for how Arm64 on Arch Linux reads info. Here is how x86 on Arch Linux reads the cpu info. This information is fetched by the Linux kernel using Microcode (layer within CPU that interprets machine instructions and orchestrates hardware to execute them) using a CPUID instruction (special instruction to query processor for details) and The kernel executes the CPUID on startup and stores the result in structure like cpuinfo_x86. The kernel also reads Model-Specific Registers (MSR’s) which are special registers that contain more information.

This notebook has a single Xeon CPU processor with a clock speed of 2 GHz meaning it can run 2 billion CPU cycles per second. It has one physical core but 2 siblings meaning that the CPU can use hyperthreading (i.e. 1 physical core can act as 2 logical cores by sharing resources for two threads simultaneously). It has a cache size of 56 mb.

Using the nvidia smi library, we can see that we have a T4 connected with ~15GB of vram.

1B Model Profiling

I picked Llama 3.2 1b, a smallish chat model to run quickly with the HuggingFace Transformer’s API. Here is a blog post I wrote exploring the HuggingFace API.

pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
response = pipe("User: Hello! How are you?\nAssistant:", max_length=200)

At a high level, let’s go over what is happening in the code above. Weights are downloaded to disk and loaded into main memory by deserializing model checkpoint into usable tensors and then transferred to GPU memory.

Input text is tokenized into tensors in RAM and transferred to GPU. Model weights and input tensors are used in matrix multiplications and other operations on GPU. Intermediate computations like activations are stored in GPU memory. After inference results are transferred back to RAM.

Time Profiling with cProfile

A snippet of my code measuring how long the pipeline took. It measured 13 seconds. Let’s see if we can dig deeper into exactly what is taking so long.

start_time = time.time()

pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
response = pipe("User: Hello! How are you?\nAssistant:", max_length=200)
print(response[0]["generated_text"])

end_time = time.time()

print(f"Execution time: {end_time - start_time} seconds")

Let’s now use cProfile, which tells us how long specific functions are executed for and how many times they are called.

def profile_code():
    pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
    response = pipe("User: Hello! How are you?\nAssistant:", max_length=200)
    return response[0]["generated_text"]

profiler = cProfile.Profile()
profiler.enable()
profile_code()
profiler.disable()

profiler.dump_stats('profile_data.prof')

# Memory usage
# mem_usage = memory_usage((profile_code,), interval=0.1)
# print(f"Memory usage over time: {mem_usage} MB")
# print(f"Peak memory usage: {max(mem_usage)} MB")

# Time Usage
stats = pstats.Stats('profile_data.prof')
stats.sort_stats('calls', 'time').print_stats(20) 
stats.sort_stats('time', 'calls').print_stats(20) 

Here is the output:

Wed Dec 25 05:34:06 2024    profile_data.prof

         794045 function calls (692599 primitive calls) in 11.434 seconds

   Ordered by: call count, internal time
   List reduced from 1171 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    97621    0.029    0.000    0.029    0.000 {method 'startswith' of 'str' objects}
    91037    0.103    0.000    0.103    0.000 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1918(__getattr__)
    43642    0.052    0.000    0.052    0.000 {built-in method torch._C._get_tracing_state}
40446/189    0.160    0.000    3.933    0.021 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1740(_call_impl)
40446/189    0.078    0.000    3.934    0.021 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1732(_wrapped_call_impl)
35819/32473    0.010    0.000    0.018    0.000 {built-in method builtins.isinstance}
    21357    0.727    0.000    0.727    0.000 {built-in method torch._C._nn.linear}
    21357    0.062    0.000    0.816    0.000 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:124(forward)
    17744    0.004    0.000    0.004    0.000 {method 'get' of 'dict' objects}
13486/13286    0.007    0.000    0.028    0.000 {built-in method builtins.len}
    13415    1.159    0.000    1.159    0.000 {method 'to' of 'torch._C.TensorBase' objects}
    12631    0.266    0.000    0.266    0.000 {built-in method torch.cat}
    12285    0.044    0.000    0.044    0.000 {method 'transpose' of 'torch._C.TensorBase' objects}
    12096    0.049    0.000    0.049    0.000 {method 'view' of 'torch._C.TensorBase' objects}
    11561    0.002    0.000    0.002    0.000 {method 'items' of 'dict' objects}
9003/1410    0.012    0.000    0.012    0.000 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:2778(named_modules)
8474/8472    0.002    0.000    0.003    0.000 {built-in method builtins.getattr}
     8032    0.001    0.000    0.001    0.000 /usr/lib/python3.10/inspect.py:2687(name)
     8008    0.001    0.000    0.001    0.000 {method 'values' of 'collections.OrderedDict' objects}
     7949    0.009    0.000    0.017    0.000 {built-in method builtins.hasattr}


Wed Dec 25 05:34:06 2024    profile_data.prof

         794045 function calls (692599 primitive calls) in 11.434 seconds

   Ordered by: internal time, call count
   List reduced from 1171 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      146    3.174    0.022    3.174    0.022 {method 'copy_' of 'torch._C.TensorBase' objects}
    13415    1.159    0.000    1.159    0.000 {method 'to' of 'torch._C.TensorBase' objects}
        1    0.826    0.826    0.826    0.826 {built-in method from_file}
      190    0.729    0.004    0.731    0.004 /usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:2423(_has_unfinished_sequences)
    21357    0.727    0.000    0.727    0.000 {built-in method torch._C._nn.linear}
        2    0.474    0.237    0.474    0.237 {method 'read' of '_ssl._SSLSocket' objects}
     6237    0.363    0.000    0.724    0.000 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:70(forward)
     3024    0.281    0.000    0.654    0.000 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:203(apply_rotary_pos_emb)
    12631    0.266    0.000    0.266    0.000 {built-in method torch.cat}
     6048    0.209    0.000    0.354    0.000 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:196(rotate_half)
    231/1    0.196    0.001    1.325    1.325 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:897(_apply)
     3024    0.172    0.000    2.084    0.001 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:491(forward)
     3024    0.169    0.000    3.665    0.001 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:601(forward)
40446/189    0.160    0.000    3.933    0.021 /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1740(_call_impl)
     6049    0.158    0.000    0.158    0.000 {method 'reshape' of 'torch._C.TensorBase' objects}
     3024    0.127    0.000    0.127    0.000 {built-in method torch._C._nn.scaled_dot_product_attention}
     6237    0.126    0.000    0.126    0.000 {method 'mean' of 'torch._C.TensorBase' objects}
     6048    0.122    0.000    0.314    0.000 /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:246(repeat_kv)
     6237    0.120    0.000    0.120    0.000 {method 'pow' of 'torch._C.TensorBase' objects}
        1    0.117    0.117    3.757    3.757 /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py:447(from_pretrained)


<pstats.Stats at 0x7bb60f873370>

Looking at the data there’s a single from_file function call that takes a long time. This is most likely reading the downloaded weights from RAM to the GPU’s memory. There are also two read operations on SSL Sockets. These are most likely related to downloading the model weights from the HuggingFace repository.

Also there’s the torch copy and to functions from TensorBase. Unfortunately it’s not clear from this what if anything can be done to reduce the number of operations, especially from such a high level api.

Analyzing Main Memory Usage

We can use this code to profile main memory usage:

# Query every 1 second
mem_usage = memory_usage((profile_code,), interval=1)
print(f"Memory usage over time: {mem_usage} MB")
print(f"Peak memory usage: {max(mem_usage)} MB")

Here’s the output:

Memory usage over time: [2913.14453125, 2913.14453125, 4477.9921875, 5004.890625, 7706.671875, 6478.078125, 6512.20703125, 4166.12109375, 3398.07421875, 3398.07421875, 3398.07421875, 3398.07421875, 3398.07421875, 2913.1953125] MB

Peak memory usage: 7706.671875 MB

Not super helpful. Ideally we can see what the operations are. It’s not clear what operations are using how much memory. In addition, this library only measure RAM and the core operations are performed on the GPU.

Analyzing GPU Usage

Here’s the PyTorch profiler library code:

from torch.profiler import profile, ProfilerActivity, record_function

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    profile_code()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

Unfortunately I’m using the free version of Google Colab and was not able to use terminal to use Tensorboard. Here’s the text output:

Before pipeline creation - GPU memory allocated: 8.12 MB
Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Before inference - GPU memory allocated: 4722.39 MB
After inference - GPU memory allocated: 4722.39 MB
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::matmul         1.50%     220.802ms        10.67%        1.575s      73.119us       0.000us         0.00%        3.847s     178.532us         21546  
                                           aten::linear         0.89%     131.999ms        13.02%        1.921s      89.962us       0.000us         0.00%        3.846s     180.092us         21357  
                                               aten::mm         5.75%     848.979ms         7.94%        1.172s      54.869us        3.846s        68.81%        3.846s     180.092us         21357  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us       0.000us        3.809s        68.14%        3.809s     179.288us         21245  
                                            aten::copy_        40.93%        6.041s        50.54%        7.459s     858.901us        1.279s        22.88%        1.279s     147.296us          8684  
                                               aten::to         0.11%      16.601ms         9.00%        1.328s      60.602us       0.000us         0.00%        1.242s      56.645us         21919  
                                         aten::_to_copy         0.04%       5.468ms         8.89%        1.312s       2.201ms       0.000us         0.00%        1.242s       2.083ms           596  
                       Memcpy HtoD (Pageable -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us        1.241s        22.20%        1.241s       7.432ms           167  
                     aten::scaled_dot_product_attention         0.41%      60.214ms         2.29%     338.007ms     111.775us       0.000us         0.00%     111.852ms      36.988us          3024  
          aten::_scaled_dot_product_efficient_attention         0.29%      43.443ms         1.88%     277.793ms      91.863us       0.000us         0.00%     111.852ms      36.988us          3024  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 14.759s
Self CUDA time total: 5.590s

GPUs, Flame Graphs, and eBPFs

Above it’s hard to determine where in the program a long call is happening. There is no unified place to delve deeper into the call stack.

I remember coming across eBPFs a while ago as a way of digging really deep into the stack for CPU profiling and Flame Graphs as a way of visualizing the data. I was excited to come across this AI Flame Graph blog post but unfortunately it isn’t publicly available at the moment.

Optimizations

In terms of optimizations, optimizing throughput for many requests will be more interesting. Some ideas:

caching responses for handling many similar requests
async processing for handling longer term requests
concurrently handling multiple requests
batch inference

As for next steps, I’ll be looking into implementing an llm “from scratch” using PyTorch and then try to actually measure and compare optimizations. This will allow me to learn mroe about the lower level concepts.

Further areas to explore

CPU Design
- CPU core hyperthreading
- specter and meltdown
- Just in time (LLVM, compilers)
- Apple M3 chip
Interaction between hardware and software
- Microcodes
- Reprogramming CPU microcode with an Arduino
Inference hardware designs
- Groq
- Cerebras
CPU Flamegraphs