Machine learning that can visualize and imagine outcomes of descriptions (like humans!) ðŸ’💡🤔
Can language models understand situations enough to imagine them and convey these projections back to us? That's what this project set out to answer—and the results are shocking. As humans, we are incredible at putting random information together to envision future circumstances in the real world. We can dream, imagine, plan, and foresee even the most random of things. For example, can you imagine a red car with wings? You probably pictured exactly that—a red car with wings, without ever before having seen a red car with wings. But can we get a computer to do the same thing? It turned out, if we make some changes to existing models and data collection techniques, we can achieve pretty remarkable results (like the real interaction below).
PROMPT
RESPONSE
In the future I envisioned, I saw big buildings and incredible technology, like flying cars
In the future I envisioned, I saw big buildings and incredible technology, like flying cars
Imagine the future.
To accomplish this life-like ability, the entire text creation process needed to be redesigned. Humans take in previous experiences and images to formulate generalizations, From the red car with wings example from before, you have seen a red car before, and you have seen wings before. But probably never together. Despite this, imagining this new combination of objects is relatively easy for us. For the first time, this new NLP model can decipher both text and image inputs, to truly visualize and predict situations and provide realistic insights.
The Method
STEP 1
Take initial prompt and gather text-based information from the web
We start with a prompt
Imagine the future
STEP 2
Take initial prompt and gather image information from the web
We start with the same prompt
Imagine the future
Inputs are parsed into key words
<the>
<future>
The web is searched for matching images
Images of
the future
Key trends are determined from web research
Cities
Cars
Tech
Health
These images are analyzed to find trends
<shiny buildings>
<sunny>
<flying planes>
Outputs
Text-based outputs are generated
Image-based outputs are made
Outputs
The future is cool
STEP 3
Combine all inputs to a trained model to construct a final answer
The prompt along with both outputs are combined
Inputs are fed into advanced NLP models
NLP Model
Models are fine-tuned to respond appropriately
In what I envisioned...
A final response is generated and shown
By combining text and images, the model is able to have a much greater understanding of the real-world impacts of any given prompt, which in turn, allows for a vastly more nuanced output. All examples on this page of the interaction between this model (prompt and response) are real, and the supporting images depicted are also a sample of real web scraped images processed by the model's architecture. I wanted to conclude this document with the first prompt we started with, to see how the model responds.
PROMPT
Imagine a red car with wings.