September 25, 2024 ⏱️ 3 min
By George S. (RnD – Mobile Group)
Gemini is the joint effort result of several Google teams, in their quest to harness the power of AI so that it can benefit users in incredible ways.
It is their most general and capable AI model yet, with the ability to process and combine information in a variety of formats: text, code, audio, images and video. What differentiates Gemini from the standard multimodal model approach is that instead of training separate components for different modalities and then combining the results, it is designed to be natively multimodal. This allows for better understanding and reasoning about the various input formats.
Gemini models
Gemini comes in different model variants, tailored to fulfill every need:
- Ultra: the largest model, suitable for highly complex tasks.
- Pro: can offer a staggering 2-million-token window, thus having the ability to process long documents, hours of video or audio or a code base spanning thousands of lines.
- Flash: optimized for speed and efficiency, with a context window of up to 1 million tokens and a first-token latency of under 1 second for the vast majority of use cases.
- Nano: built for on-device tasks and does not require a network connection. It is currently included in Pixel phones offering the ability to extract summaries from audio files and transforming written text in different styles using Magic Compose.
More details at the following link.
Use cases
- Greenhouses: plant identification based on image recognition, personalized diagnose and care guide.
- Manufacturing: real-time inventory management, predictive maintenance.
- Education: peer-to-peer tutoring, language learning with interactive games.
- Transportation and Logistics: real-time traffic prediction and route optimization, smart parking.
- Warehouse management: inventory management and optimization, object recognition.
Gemini Integration
The latest AI models are exposed via the Gemini API, which is available on all platforms. Official cookbook can be explored here.
In addition to the API implementation, you can already start using Gemini as your default assistant on iOS & Android.
Pricing models
Usage and pricing are token based, some examples below on how the tokens are counted and official documentation here:
- Nr of tokens used per word – typically the token count is the same as the syllables, plus the punctuation marks. Thus, for the following input “Hello, how are you?”, the token count is 7 (2 for Hello; 1 for comma; 1 for how; 1 for are; 1 for you; 1 for question mark). The model’s response is also counted towards the total number of tokens, using the same formula.
- Nr of tokens used per image – approximately 260 tokens for a regular sized image (1920x1200px – 312 KB). Seems to be a static token count for all images since it doesn’t fluctuate with the image size or resolution.
For testing purposes, the official IDE can be used for free.
Conclusion
Integrating Gemini into web and mobile applications is quick and straightforward, thanks to the SDK support for multiple platforms. It’s crucial to first identify the areas in your app that can benefit most from Gemini integration to maximize its impact. Choosing the right model depends heavily on your specific use case, ensuring that the solution fits your needs perfectly. Additionally, the pricing plan is determined by token usage, allowing for flexible and scalable cost management.