My Impressions of Be My AI

I’ve done more testing of multimodal GPT-4, (the model powering Be My AI) in the last few days, most of it with sighted friends, and my impressions are thus: The thing is pretty accurate when describing memes, but the descriptions are often far too long, far too verbose, and the facts are presented in an order that makes the meme less funny than it should be. There’s a fair bit of christian-fundamentalist pruderity being applied, and people who aren’t wearing any clothes are described as “cropped.” Same goes for faces, which are "blurred out for privacy", even if they actually aren't. This would be a sensible privacy precaution, if not for the fact that the bluring also occurs for the faces of famous people, making many images meaningless. The algorithm can sometimes notice details that sighted people don't until they're actually pointed out to them. However, it's pretty bad when it comes to actually useful stuff such as diagrams, figures etc. To give just one example, we gave it a run-of-the-mill diagram of a chessboard, and it described the chess positions in vivid detail, while being absolutely wrong about what these positions actually were, as these AI models tend to do. It's even worse with text, especially foreign-language text. Unlike many OCR algorithms, which produce text containing many typos, Be My Ai's output is almost always free of those, it makes grammatical sense and is contextually related to what the image actually contains, but the text that it claims is in the image actually isn't there. For example, when we gave it a page of a coffee machine manual in Polish, with pictures and descriptions of the various kinds of coffee that the machine can make, it got the coffee names right and the coffee descriptions were pretty accurate, but they were completely different descriptions from those on the actual page! It's also pretty clear that the tokenizer Be My AI uses was primarily trained on English text. This causes foreign-language output to need more tokens for the same amount of characters, which lengthens generation time and, more crucially, often causes the text to be cropped prematurely. In conclusion, I stand by my opinion that this tool holds great promise for the future, but in its current incarnation, has very limited use for a blind person and is barely more than an occasionally useful but fun toy.