A unique approach to knowledge distillation

Photo by Adam Winger on Unsplash

The field of natural language processing has been revolutionized with the advent of large pretrained models like BERT and GPT-3. These models have been able to capture incredible amounts of information from the vast amounts of text they’ve been trained on and use this information to reach state of the art performance and continually improve on a wide variety of tasks like classification, summarization and entailment.

One reason for these models’ great performance is their size. BERT-base has 110 million parameters, BERT-large with 340 million parameters, GPT-2 with 1.5 billion parameters and GPT-3 has a whopping 175 billion parameters. As…

A suite of models that fixes inaccuracies in summaries of large text

Photo by madeleine ragsdale on Unsplash

When I’m trying to decide whether I should read a book, the first thing I look at is the summary on the back. This is also true for when I have to determine whether a scientific paper is relevant to my work, whether an article could be interesting or anything else that involves reading a lot of text. For this to be effective, the summary needs to be concise and accurate.

Machine learning researchers have focused on summarizing large chunks of text and developed models that can generate grammatically correct and coherent summaries. These summarization models fall into 2 broad…

A model that understands a video and its subtitles to help you speed through it

Photo by Joey Nicotra on Unsplash

Have you ever had to watch 15 lectures worth of material the night before an exam? Or submit a report about a 2-hour movie that you haven’t watched in the next hour? I know I have. Luckily, my lecture videos were marked with where content started and ended so I could just skim through them. But what if those markers weren’t there? Fear not: HERO is here to save you!

“What is HERO and how will it save me?”, you might ask. Your question will be answered but first, let me give you some context. If you’ve read some of…

A technique that promotes a deeper understanding between different languages

Photo by GRÆS Magazine on Unsplash

Based on sources from across the internet, the total world population that speaks English is somewhere between 1 in 6 and 1 in 7. Despite this underwhelming minority of the world’s population speaking English, a vast majority of natural language datasets for understanding and generation like the Stanford Question Answering dataset (SQuAD) and GLUE datasets as well as the large scale pretrained models like BERT, RoBERTa and ALBERT that have revolutionized the NLP world are solely based on the English language.

However, there has been a recent focus on other languages with the creation of multi-lingual large scale pretrained models…

Use BERT to summarize all those long documents that you don’t want to read

Photo by Brady Bellini on Unsplash

With the world being so connected now, we’re constantly being bombarded with information from a variety of different sources and this can be overwhelming. Social media has also transformed the way information is presented to us, with apps like Instagram and Pinterest focusing mostly on visual content over text. This makes it less appealing to read large swathes of text when we have to.

But what if those large swathes of text could be converted into a summary of just the key points? In comes machine learning to the rescue again.

Summarization has been a task of interest to the…

Generate better stories, essays, answers and more using BERT’s knowledge

Photo by MILKOVÍ on Unsplash

The field of natural language processing is now in the age of large scale pretrained models being the first thing to try for almost any new task. Models like BERT, RoBERTa and ALBERT are so large and have been trained with so much data that they are able to generalize their pre-trained knowledge to understand any downstream tasks that you can use them for. But that’s all they can do — understand. …

The first multimodal adversarial training technique for image and text models

Image by Gerd Altmann from Pixabay

Building large scale pretrained models for vision-and-language multimodal learning has been a booming area of machine learning research recently. Models like UNITER, ViLBERT and LXMERT are trained on millions of image-text pairs with different objective functions and are able to learn an effective generic joint representation of image and text.

These representations are then finetuned on a specific task to achieve state of the art performance on a myriad of image and language tasks like image QA, image-text retrieval, image grounding, image inference and more. When finetuning on a specific task, the amount of data available is only a fraction…

Create a masterpiece without a pencil or paintbrush

Photo by Adli Wahid on Unsplash

Picture this: One day, you dream up an artwork so awe striking, filled with vibrant colors and beauty that evokes primal feelings in the beholder and can absorb them for hours. You wake up and jump out of bed filled with excitement, ready to bring this magical piece to life. Days later, you realize you’re no Picasso and your execution of this piece could never do justice to what you imagined. You leave your canvas frustrated at your inability to bring magic to the world. But what if this story could go another way? What if you didn’t have to…

A new task and dataset that requires multi-modal understanding of videos and text

Photo by Cameron Mourot on Unsplash

Our media consumption through visual means has grown drastically over the past decade. In fact, between 2013 and 2018, the rate of video consumption grew by a stunning 32% every year and we’re showing no signs of slowing down. Social media platforms like Tiktok, Pinterest and Instagram are filled exclusively with images and videos and businesses are increasingly switching to visual platforms like PowerBI, Tableau and SAS Visual Analytics. With such a large amount of our information coming from visual cues, the natural next step for machine learning researchers is to build models that can understand and analyze them.


Learning a joint representation of image and text that everything can use

Image by Patricia Hébert from Pixabay

Multimodal learning is omnipresent in our lives. Humans absorb content in different ways, whether through pictures (visual), text, spoken explanations (audio) to name a few. Each of these sources of knowledge is known as a mode. In fact, we often learn through a combination of these modes, giving everyone a unique learning experience. The McGurk effect — when seeing a person say something (ga-ga) but hearing different audio (ba-ba) causes the observer to perceive a third sound (da-da) — is a prime example of the ways different modalities interact with one another. …

Rohit Pillai

I’m an engineer at Microsoft Dynamics 365 AI Research and I’ll post our new NLP, CV and Multimodal research . Check out https://medium.com/@rohit.rameshp

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store