{"id":1336,"date":"2021-12-08T07:19:44","date_gmt":"2021-12-08T07:19:44","guid":{"rendered":"https:\/\/toshareproject.it\/artmakerblog\/?p=1336"},"modified":"2021-12-08T07:19:44","modified_gmt":"2021-12-08T07:19:44","slug":"karpathy-gpt-model-on-github","status":"publish","type":"post","link":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/","title":{"rendered":"Karpathy GPT model on GitHub"},"content":{"rendered":"<p>*It&#8217;s a neural net architecture, but it&#8217;s so small and compact.  I can put all its python code into one blog post.  So, why not?  At the Share Artmaker Blog, we&#8217;re here to serve!<\/p>\n<p>Andrei Karpathy is the director of AI at Tesla, leading the Autopilot Vision team. Previously OpenAI, CS231n, PhD @ Stanford. <\/p>\n<p><a href=\"https:\/\/github.com\/karpathy\/minGPT\/blob\/master\/mingpt\/model.py\">https:\/\/github.com\/karpathy\/minGPT\/blob\/master\/mingpt\/model.py<\/a><\/p>\n<p>&#8220;&#8221;&#8221;<br \/>\nGPT model:<br \/>\n&#8211; the initial stem consists of a combination of token encoding and a positional encoding<br \/>\n&#8211; the meat of it is a uniform sequence of Transformer blocks<br \/>\n    &#8211; each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block<br \/>\n    &#8211; all blocks feed into a central residual pathway similar to resnets<br \/>\n&#8211; the final decoder is a linear projection into a vanilla Softmax classifier<br \/>\n&#8220;&#8221;&#8221;<\/p>\n<p>import math<br \/>\nimport logging<\/p>\n<p>import torch<br \/>\nimport torch.nn as nn<br \/>\nfrom torch.nn import functional as F<\/p>\n<p>logger = logging.getLogger(__name__)<\/p>\n<p>class GPTConfig:<br \/>\n    &#8220;&#8221;&#8221; base GPT config, params common to all GPT versions &#8220;&#8221;&#8221;<br \/>\n    embd_pdrop = 0.1<br \/>\n    resid_pdrop = 0.1<br \/>\n    attn_pdrop = 0.1<\/p>\n<p>    def __init__(self, vocab_size, block_size, **kwargs):<br \/>\n        self.vocab_size = vocab_size<br \/>\n        self.block_size = block_size<br \/>\n        for k,v in kwargs.items():<br \/>\n            setattr(self, k, v)<\/p>\n<p>class GPT1Config(GPTConfig):<br \/>\n    &#8220;&#8221;&#8221; GPT-1 like network roughly 125M params &#8220;&#8221;&#8221;<br \/>\n    n_layer = 12<br \/>\n    n_head = 12<br \/>\n    n_embd = 768<\/p>\n<p>class CausalSelfAttention(nn.Module):<br \/>\n    &#8220;&#8221;&#8221;<br \/>\n    A vanilla multi-head masked self-attention layer with a projection at the end.<br \/>\n    It is possible to use torch.nn.MultiheadAttention here but I am including an<br \/>\n    explicit implementation here to show that there is nothing too scary here.<br \/>\n    &#8220;&#8221;&#8221;<\/p>\n<p>    def __init__(self, config):<br \/>\n        super().__init__()<br \/>\n        assert config.n_embd % config.n_head == 0<br \/>\n        # key, query, value projections for all heads<br \/>\n        self.key = nn.Linear(config.n_embd, config.n_embd)<br \/>\n        self.query = nn.Linear(config.n_embd, config.n_embd)<br \/>\n        self.value = nn.Linear(config.n_embd, config.n_embd)<br \/>\n        # regularization<br \/>\n        self.attn_drop = nn.Dropout(config.attn_pdrop)<br \/>\n        self.resid_drop = nn.Dropout(config.resid_pdrop)<br \/>\n        # output projection<br \/>\n        self.proj = nn.Linear(config.n_embd, config.n_embd)<br \/>\n        # causal mask to ensure that attention is only applied to the left in the input sequence<br \/>\n        self.register_buffer(&#8220;mask&#8221;, torch.tril(torch.ones(config.block_size, config.block_size))<br \/>\n                                     .view(1, 1, config.block_size, config.block_size))<br \/>\n        self.n_head = config.n_head<\/p>\n<p>    def forward(self, x, layer_past=None):<br \/>\n        B, T, C = x.size()<\/p>\n<p>        # calculate query, key, values for all heads in batch and move head forward to be the batch dim<br \/>\n        k = self.key(x).view(B, T, self.n_head, C \/\/ self.n_head).transpose(1, 2) # (B, nh, T, hs)<br \/>\n        q = self.query(x).view(B, T, self.n_head, C \/\/ self.n_head).transpose(1, 2) # (B, nh, T, hs)<br \/>\n        v = self.value(x).view(B, T, self.n_head, C \/\/ self.n_head).transpose(1, 2) # (B, nh, T, hs)<\/p>\n<p>        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)<br \/>\n        att = (q @ k.transpose(-2, -1)) * (1.0 \/ math.sqrt(k.size(-1)))<br \/>\n        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float(&#8216;-inf&#8217;))<br \/>\n        att = F.softmax(att, dim=-1)<br \/>\n        att = self.attn_drop(att)<br \/>\n        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)<br \/>\n        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side<\/p>\n<p>        # output projection<br \/>\n        y = self.resid_drop(self.proj(y))<br \/>\n        return y<\/p>\n<p>class Block(nn.Module):<br \/>\n    &#8220;&#8221;&#8221; an unassuming Transformer block &#8220;&#8221;&#8221;<\/p>\n<p>    def __init__(self, config):<br \/>\n        super().__init__()<br \/>\n        self.ln1 = nn.LayerNorm(config.n_embd)<br \/>\n        self.ln2 = nn.LayerNorm(config.n_embd)<br \/>\n        self.attn = CausalSelfAttention(config)<br \/>\n        self.mlp = nn.Sequential(<br \/>\n            nn.Linear(config.n_embd, 4 * config.n_embd),<br \/>\n            nn.GELU(),<br \/>\n            nn.Linear(4 * config.n_embd, config.n_embd),<br \/>\n            nn.Dropout(config.resid_pdrop),<br \/>\n        )<\/p>\n<p>    def forward(self, x):<br \/>\n        x = x + self.attn(self.ln1(x))<br \/>\n        x = x + self.mlp(self.ln2(x))<br \/>\n        return x<\/p>\n<p>class GPT(nn.Module):<br \/>\n    &#8220;&#8221;&#8221;  the full GPT language model, with a context size of block_size &#8220;&#8221;&#8221;<\/p>\n<p>    def __init__(self, config):<br \/>\n        super().__init__()<\/p>\n<p>        # input embedding stem<br \/>\n        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)<br \/>\n        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))<br \/>\n        self.drop = nn.Dropout(config.embd_pdrop)<br \/>\n        # transformer<br \/>\n        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])<br \/>\n        # decoder head<br \/>\n        self.ln_f = nn.LayerNorm(config.n_embd)<br \/>\n        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)<\/p>\n<p>        self.block_size = config.block_size<br \/>\n        self.apply(self._init_weights)<\/p>\n<p>        logger.info(&#8220;number of parameters: %e&#8221;, sum(p.numel() for p in self.parameters()))<\/p>\n<p>    def get_block_size(self):<br \/>\n        return self.block_size<\/p>\n<p>    def _init_weights(self, module):<br \/>\n        if isinstance(module, (nn.Linear, nn.Embedding)):<br \/>\n            module.weight.data.normal_(mean=0.0, std=0.02)<br \/>\n            if isinstance(module, nn.Linear) and module.bias is not None:<br \/>\n                module.bias.data.zero_()<br \/>\n        elif isinstance(module, nn.LayerNorm):<br \/>\n            module.bias.data.zero_()<br \/>\n            module.weight.data.fill_(1.0)<\/p>\n<p>    def configure_optimizers(self, train_config):<br \/>\n        &#8220;&#8221;&#8221;<br \/>\n        This long function is unfortunately doing something very simple and is being very defensive:<br \/>\n        We are separating out all parameters of the model into two buckets: those that will experience<br \/>\n        weight decay for regularization and those that won&#8217;t (biases, and layernorm\/embedding weights).<br \/>\n        We are then returning the PyTorch optimizer object.<br \/>\n        &#8220;&#8221;&#8221;<\/p>\n<p>        # separate out all parameters to those that will and won&#8217;t experience regularizing weight decay<br \/>\n        decay = set()<br \/>\n        no_decay = set()<br \/>\n        whitelist_weight_modules = (torch.nn.Linear, )<br \/>\n        blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)<br \/>\n        for mn, m in self.named_modules():<br \/>\n            for pn, p in m.named_parameters():<br \/>\n                fpn = &#8216;%s.%s&#8217; % (mn, pn) if mn else pn # full param name<\/p>\n<p>                if pn.endswith(&#8216;bias&#8217;):<br \/>\n                    # all biases will not be decayed<br \/>\n                    no_decay.add(fpn)<br \/>\n                elif pn.endswith(&#8216;weight&#8217;) and isinstance(m, whitelist_weight_modules):<br \/>\n                    # weights of whitelist modules will be weight decayed<br \/>\n                    decay.add(fpn)<br \/>\n                elif pn.endswith(&#8216;weight&#8217;) and isinstance(m, blacklist_weight_modules):<br \/>\n                    # weights of blacklist modules will NOT be weight decayed<br \/>\n                    no_decay.add(fpn)<\/p>\n<p>        # special case the position embedding parameter in the root GPT module as not decayed<br \/>\n        no_decay.add(&#8216;pos_emb&#8217;)<\/p>\n<p>        # validate that we considered every parameter<br \/>\n        param_dict = {pn: p for pn, p in self.named_parameters()}<br \/>\n        inter_params = decay &#038; no_decay<br \/>\n        union_params = decay | no_decay<br \/>\n        assert len(inter_params) == 0, &#8220;parameters %s made it into both decay\/no_decay sets!&#8221; % (str(inter_params), )<br \/>\n        assert len(param_dict.keys() &#8211; union_params) == 0, &#8220;parameters %s were not separated into either decay\/no_decay set!&#8221; \\<br \/>\n                                                    % (str(param_dict.keys() &#8211; union_params), )<\/p>\n<p>        # create the pytorch optimizer object<br \/>\n        optim_groups = [<br \/>\n            {&#8220;params&#8221;: [param_dict[pn] for pn in sorted(list(decay))], &#8220;weight_decay&#8221;: train_config.weight_decay},<br \/>\n            {&#8220;params&#8221;: [param_dict[pn] for pn in sorted(list(no_decay))], &#8220;weight_decay&#8221;: 0.0},<br \/>\n        ]<br \/>\n        optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)<br \/>\n        return optimizer<\/p>\n<p>    def forward(self, idx, targets=None):<br \/>\n        b, t = idx.size()<br \/>\n        assert t <= self.block_size, \"Cannot forward, model block size is exhausted.\"\n\n        # forward the GPT model\n        token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector\n        position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector\n        x = self.drop(token_embeddings + position_embeddings)\n        x = self.blocks(x)\n        x = self.ln_f(x)\n        logits = self.head(x)\n\n        # if we are given some desired targets also calculate the loss\n        loss = None\n        if targets is not None:\n            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))\n\n        return logits, loss\n<\/p>\n","protected":false},"excerpt":{"rendered":"<p>*It&#8217;s a neural net architecture, but it&#8217;s so small and compact. I can put all its python code into one blog post. So, why not? At the Share Artmaker Blog, we&#8217;re here to serve! Andrei Karpathy is the director of AI at Tesla, leading the Autopilot Vision team. Previously OpenAI, CS231n, PhD @ Stanford. https:\/\/github.com\/karpathy\/minGPT\/blob\/master\/mingpt\/model.py [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1336","post","type-post","status-publish","format-standard","hentry","category-uncategorised"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v17.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Karpathy GPT model on GitHub | Artmaker Blog<\/title>\n<meta name=\"description\" content=\"Karpathy GPT model on GitHub | Artmaker Blog\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Karpathy GPT model on GitHub | Artmaker Blog\" \/>\n<meta property=\"og:description\" content=\"Karpathy GPT model on GitHub | Artmaker Blog\" \/>\n<meta property=\"og:url\" content=\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/\" \/>\n<meta property=\"og:site_name\" content=\"Artmaker Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-12-08T07:19:44+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Bruce Sterling\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#website\",\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/\",\"name\":\"Artmaker Blog\",\"description\":\"on Toshareproject.it - curated by Bruce Sterling\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/toshareproject.it\/artmakerblog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#webpage\",\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/\",\"name\":\"Karpathy GPT model on GitHub | Artmaker Blog\",\"isPartOf\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#website\"},\"datePublished\":\"2021-12-08T07:19:44+00:00\",\"dateModified\":\"2021-12-08T07:19:44+00:00\",\"author\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085\"},\"description\":\"Karpathy GPT model on GitHub | Artmaker Blog\",\"breadcrumb\":{\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/toshareproject.it\/artmakerblog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Karpathy GPT model on GitHub\"}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085\",\"name\":\"Bruce Sterling\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/toshareproject.it\/artmakerblog\/#personlogo\",\"inLanguage\":\"en-GB\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g\",\"caption\":\"Bruce Sterling\"},\"description\":\"Art director at Share Festival, author and journalist\",\"sameAs\":[\"http:\/\/toshareproject.it\/tomorrowart\"],\"url\":\"https:\/\/toshareproject.it\/artmakerblog\/author\/brucesterling\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Karpathy GPT model on GitHub | Artmaker Blog","description":"Karpathy GPT model on GitHub | Artmaker Blog","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/","og_locale":"en_GB","og_type":"article","og_title":"Karpathy GPT model on GitHub | Artmaker Blog","og_description":"Karpathy GPT model on GitHub | Artmaker Blog","og_url":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/","og_site_name":"Artmaker Blog","article_published_time":"2021-12-08T07:19:44+00:00","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Bruce Sterling","Estimated reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebSite","@id":"https:\/\/toshareproject.it\/artmakerblog\/#website","url":"https:\/\/toshareproject.it\/artmakerblog\/","name":"Artmaker Blog","description":"on Toshareproject.it - curated by Bruce Sterling","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/toshareproject.it\/artmakerblog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#webpage","url":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/","name":"Karpathy GPT model on GitHub | Artmaker Blog","isPartOf":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/#website"},"datePublished":"2021-12-08T07:19:44+00:00","dateModified":"2021-12-08T07:19:44+00:00","author":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085"},"description":"Karpathy GPT model on GitHub | Artmaker Blog","breadcrumb":{"@id":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/toshareproject.it\/artmakerblog\/karpathy-gpt-model-on-github\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/toshareproject.it\/artmakerblog\/"},{"@type":"ListItem","position":2,"name":"Karpathy GPT model on GitHub"}]},{"@type":"Person","@id":"https:\/\/toshareproject.it\/artmakerblog\/#\/schema\/person\/6f20726ed2761431f3e0ff4e096c3085","name":"Bruce Sterling","image":{"@type":"ImageObject","@id":"https:\/\/toshareproject.it\/artmakerblog\/#personlogo","inLanguage":"en-GB","url":"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c390e8ed4db57a34278dcf667f928a643cf769a865c8a8632dcd310412bb9a99?s=96&d=mm&r=g","caption":"Bruce Sterling"},"description":"Art director at Share Festival, author and journalist","sameAs":["http:\/\/toshareproject.it\/tomorrowart"],"url":"https:\/\/toshareproject.it\/artmakerblog\/author\/brucesterling\/"}]}},"_links":{"self":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/1336","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/comments?post=1336"}],"version-history":[{"count":1,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/1336\/revisions"}],"predecessor-version":[{"id":1337,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/posts\/1336\/revisions\/1337"}],"wp:attachment":[{"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/media?parent=1336"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/categories?post=1336"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/toshareproject.it\/artmakerblog\/wp-json\/wp\/v2\/tags?post=1336"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}