AI Comes for Academics. Can We Rely on It?

Main navigation

AI Comes for Academics. Can We Rely on It?

Newer AI platforms are not supposed to hallucinate scientific papers, but the smaller mistakes they make are still significant.

Jonathan Jarry M.Sc. | 19 Sep 2025

Critical Thinking

Technology

Add to calendar

By now, the fact that artificial intelligence can hallucinate is, I hope, well known. There are countless examples of platforms like ChatGPT giving the wrong answer to a straightforward question or imagining a bit of information that does not exist. Notably, when Robert F. Kennedy Jr鈥檚聽MAHA Report聽came out, eagle-eyed journalists spotted fake citations in it, meaning scientific papers that do not exist and that were hallucinated by the AI that was presumably used to, ahem,聽help聽with the report.

But what if hallucinations were a thing of the past? We are witnessing the rise of AI tools aimed at academics and healthcare professionals whose distinguishing claim is that they cannot hallucinate citations. The platform Consensus was recently brought to聽my聽attention by an academic librarian. Its makers say it only uses artificial intelligence聽after it has searched the scientific literature.聽It is meant to be a quick, automated librarian/assistant that will scour the academic literature on a given topic, analyze relevant papers in depth, and synthesize findings across multiple studies. Want to know if fentanyl can be quickly absorbed by the skin? You can type that question into Consensus and get an allegedly science-based answer within seconds.

The Consensus website聽聽that, because of how their platform is built, fake citations and wrong facts drawn from the AI鈥檚 internal memory are impossible. 鈥淓very paper we cite is guaranteed to be real,鈥� it states. The only hallucination that can occur is that the AI could misread a real paper.

If tools like Consensus are being used by university students, professors, and doctors, I wanted to see how reliable they were. I spent a few days testing Consensus, baiting it into hallucinating, and I also compared eight different AIs to see how they would answer the same four science-related questions.

The bottom line is that what we are calling artificial intelligence has gotten聽really good聽and many of the problems of the recent past have been dramatically reduced. But smaller, significant problems emerged during聽my聽test drive which the average user might miss if they are enthralled by the output.

Why won鈥檛 you hallucinate?

Consensus has three modes. Quick mode summarizes up to 10 papers鈥攖he most relevant ones, it claims鈥攂y only using their abstract (the summary at the top of a paper); Pro mode looks at up to 20 papers; and Deep mode can review up to 50 papers (both Pro and Deep modes use the complete papers but only when they were published open access or when Consensus is linked up to your university's library, otherwise they are limited to the abstracts).聽Pro and Deep require monthly subscriptions for continuous use, while the free use of Consensus includes unlimited Quick mode searches and a few Pro and Deep freebies. (The tests I list below were done using Quick mode unless otherwise specified, as it is likely to be the most commonly used given that it is currently free.)

There鈥檚 a lot to admire about Consensus, including its resilience to being goaded into hallucinating. I asked it to write me a three-paragraph summary of the scientific literature on the topic of venous vitreous adenoma, a medical condition I made up. It did not find any relevant research papers on it. I asked it to summarize the abstract of two of the fake citations from the MAHA Report, and in both cases it did not make anything up; it simply did not find the paper. I wondered if it could summarize the literature on the transformative hermeneutics of quantum gravity and its real-world applications. This is a clear reference to聽. Consensus wrote that the literature on this subject was sparse because the phrase originates from a well-known hoax. Good job, AI.

It successfully summarized a paper I had published in grad school; correctly answered the question of whether or not the COVID-19 vaccine could integrate inside the human genome (the answer is聽no); and adequately described the infamous聽Hawthorne effect, which has come under heavy fire lately.

Though I did not check every one of the hundreds of citations it churned out, I did not catch any hallucinated reference, testing many of them to make sure they did exist and had the correct title and authorship. This was not an exhaustive test, however. It is still possible that someone will catch Consensus making up a citation, though given how it is allegedly built, it seems highly unlikely.

So far, so good. But the Devil of this particular AI hides in the details.

Pull the lever again for a different answer

When answering some questions, Consensus generated a coloured summary bar it calls the Consensus Meter. It鈥檚 supposed to be a graph you can glance at to see how many sources say the answer to your question is 鈥測es,鈥� 鈥減ossibly,鈥� 鈥渕ixed,鈥� or 鈥渘o,鈥� but often the graph did not match the written summary and was downright misleading. Look at the Consensus Meter on 鈥淐an ivermectin treat cancer?鈥�:

Figure 1: Consensus Meter provided by a Consensus search in Quick mode for the question 鈥淐an ivermectin treat cancer?鈥�

While the text specifies that there isn鈥檛 sufficient evidence in humans, and while the average user will be asking this question not because they鈥檙e curious about curing cancer in mice, the graph points to a preponderance of evidence toward ivermectin curing cancer. I saw similarly skewed graphs on the question of fentanyl being absorbed through the skin; on the health benefits of ear seeds; on whether or not craniosacral therapy works for chronic pain; and on whether or not there is evidence for essential oils improving memory. The problem is that these graphs are made from a limited number of papers (which papers? how does the AI choose?), and that the聽飞辞谤迟丑听of each study聽. It鈥檚 a bit like Rotten Tomatoes, where the judgment of a seasoned film critic is equated to that of a brand-new influencer. But scientific value is not additive: not all studies are created equal.

The Consensus AI is also very generous toward pseudoscience. It calls functional medicine 鈥減romising,鈥� even though it is a funnel to bring doctors into聽overprescribing dietary supplements, and it portrays cupping as 鈥渁n ancient practice that has gained interest for its potential health benefits.鈥� The reason is that it can鈥檛 judge the plausibility of an intervention. Ear acupuncture is based on the idea that the ear looks like an inverted fetus; therefore, pressing down on where the feet are on this superimposed image should heal your big toe after you stub it coming out of bed. Consensus doesn鈥檛 know that, and because so much of the pushback against pseudoscience exists outside of the academic literature鈥攐n blogs, podcasts, YouTube, and magazines鈥攊t might as well not exist for the AI. That鈥檚 how I ended up being told that ear acupuncture has 鈥渟ome health benefits鈥� for 鈥渟pecific conditions,鈥� including obesity reduction.

On the subject of hallucinations, I did make up another cancer, this one called 鈥渞ectal meningioma.鈥� The phrase does not appear on DuckDuckGo or on PubMed, a repository of biomedical papers, and for good reasons: a meningioma is a cancer of the meninges, thin membranes that protect our brain, and our backside is remarkably devoid of them. Yet when I asked Consensus to write me a three-paragraph summary of the scientific literature on the benefits of cisplatin (a chemotherapeutic drug) for the treatment of rectal meningioma, it said the research on this was 鈥渓imited鈥� because a meningioma in the rectum is 鈥渆xtremely rare.鈥� Maybe there are cases of people with brain meningiomas that metastasize where the sun don鈥檛 shine鈥�maybe鈥攂ut this phrasing looks misleading to me.

贬辞飞听you ask the question also changes the answer. When I asked for the 鈥渂enefits鈥� of cupping, it gave me the answer I might get from a traditional Chinese medicine practitioner who thinks I鈥檓 an open-minded scientist; when I asked for the 鈥渟cientific consensus鈥� on cupping, however, I got a much more sober, scientific appraisal of the 鈥渋nsufficient high-quality evidence.鈥� The papers it had used to answer both questions were also different. On a similar note, college instructor Genna Buck reported on聽聽that Google鈥檚 AI鈥攏ot Consensus, to be clear鈥攈as been shown to mislead people: two insignificant typos in a query led to the AI falsely declaring that two birth control pills had an increased risk of blood clots, but when the typos were corrected, the misinformation disappeared.

Even with the聽exact same聽question, though, the answer you get will differ, which means that it鈥檚 a little bit like a slot machine. You can pull the lever in the same way but your results will vary. Three times in a row, I asked Consensus in Quick mode the following: 鈥淗ow safe is it to take acetaminophen during pregnancy?鈥� Twice, it told me that 鈥渞ecent research raises concerns about the potential risk鈥� for neurodevelopmental problems in the fetus, which is worrying鈥� but the third time, it said that 鈥済rowing evidence suggests potential links鈥� but that this association might instead be explained by the reason why the mother is taking acetaminophen in the first place: because she is ill or has a fever. It might be the illness that causes neurodevelopmental problems in the fetus. This is an important point to bring up, but it was absent in the other two summaries.

Likewise, Consensus told me twice that 鈥渟ome studies have found no increased risk鈥� of low birth weight when the mother takes acetaminophen. 鈥淪ome studies.鈥� Is that reassuring? What do the other studies say? But during聽my聽second time asking the question, it became 鈥渓arge studies and systematic reviews have found no significant increase in this risk.鈥� That鈥檚 different.

Table 1: Summary of how Consensus in Quick mode described various risks reported in the scientific literature regarding the use of acetaminophen during pregnancy for three successive identical searches (labelled 鈥�1,鈥� 鈥�2,鈥� and 鈥�3鈥�). Text in blue highlights significant discrepancies.

奥丑别苍听Universit茅 de Montr茅al聽medical librarian Amy Bergeron tested Consensus by searching for a rather pointed surgical question (鈥渨hat outcomes are associated with the use of pedicled flaps in oral cavity reconstruction after cancer treatment?鈥�), Quick mode told her that pedicled flaps have a聽丑颈驳丑别谤听rate of complications than free flaps. Pro mode? The exact聽opposite. In both cases, the AI claims it looked at 10 papers.

Figure 2: Two identical searches using Consensus in either Quick or Pro mode revealing contradictory information, as presented by librarian Amy Bergeron

All of these examples are adding up to portray Consensus as a sort of throwing of a pair of loaded dice. Sure, it鈥檚 quite good and fairly reliable, but the answer you get聽can聽be inconsistent from one roll of the dice to the next. As proof, I tried asking the same question she had asked, and both the Quick and Pro modes told me that pedicled flaps had fewer complications than free flaps. The contradiction was gone. Will it return?

A tournament of AI champions

Consensus is not the only game in town, however. Multiple AIs aimed at researchers have popped up recently, and platforms like ChatGPT meant for a general audience are also receiving health queries. How accurate are they, I wondered, in September 2025?

I asked the same four science-related questions to eight of these platforms: ChatGPT, Gemini, Microsoft 365 Copilot, and Claude Sonnet 4, as well as the made-for-academia AIs SciSpace, Elicit, OpenEvidence and Consensus (using both Quick and Pro modes).

鈥淗ow safe is it to take acetaminophen during pregnancy?鈥� I asked this question since RFK Jr brought it up recently as his desirable answer to the question of 鈥渨hat causes autism?鈥� Many people will be interrogating their favourite AI on this subject. The answer is: it is聽聽that acetaminophen increases the chances of having a child with autism. Consensus in either mode, Elicit, Gemini, and ChatGPT all successfully pointed out that the increased risk seen in some studies of neurodevelopmental disorders (including autism) could be due to the reason the person is taking acetaminophen and not the acetaminophen itself, while the other platforms did not mention this important caveat. Copilot was particularly alarmist about the question.

鈥淲hat are the benefits of homeopathy for upper respiratory tract infections?鈥� The answer here is none: homeopathy is聽a debunked practice聽involving the ultra-dilution of ingredients that cause the very symptom meant to be cured. Overall, the AIs performed well, but ChatGPT went with false balance, allowing the user to pick a side between homeopaths and the scientific consensus, while Copilot more or less endorsed homeopathy, providing a long list of 鈥渞eported benefits.鈥�

鈥淲rite a three-paragraph summary of the scientific literature on the benefits of seroquel for the treatment of wet age-related macular degeneration.鈥� While wet AMD is a real disease, Seroquel (generic name quetiapine) is not a treatment for it; it is an antipsychotic agent used to treat psychiatric illnesses like schizophrenia. I wanted to see if the AIs would hallucinate. None of them did. They all pointed out that Seroquel is not used to treat wet AMD and provided the names of actual treatments for the condition.

Finally, I asked 鈥淐an fentanyl be quickly absorbed by the skin and cause adverse reactions?鈥� As I鈥檝e written about before, the answer is聽no, but many in law enforcement wrongly believe that quick, accidental contact with fentanyl will put them in the hospital. Every platform performed well on this question, though Claude was a bit tepid in its phrasing: 鈥淔irst responders and law enforcement have reported potential exposures, though the actual risk from brief contact with small amounts on intact skin appears to be lower than initially thought based on more recent research.鈥� And as pointed out earlier, the Consensus Meter for the Consensus Pro answer was misleading: it made it look like聽yes, fentanyl can be quickly absorbed by the skin and cause problems.

So, what do I make of all this?

Do not blindly trust a machine

The AI platforms aimed at academics drastically differ in their speeds. Consensus and OpenEvidence were lightning fast, while SciSpace and Elicit were very slow, taking roughly 10 minutes to answer a single query. Given our love of convenience, the former may win out simply for churning out information in the blink of an eye.

These AIs will undoubtedly start to be used to put together and publish systematic reviews of the evidence, a very useful bird鈥檚-eye view on a given topic. Already, there has been an聽聽in their publication in recent years, because of how straightforward they are to put together and how much prestige and citations they confer their authors. Now, the painful process of searching the literature and extracting information from papers can be automated. More systematic reviews is good news聽if they are reliable;聽if not, there is cause for concern, as the literature will become polluted by substandard fare.

But in testing for accuracy, I sidestepped a deeper question:聽蝉丑辞耻濒诲听we use AI? Artificial intelligence requires聽, and its increased use will put pressure on the environment while we deal with an escalating climate crisis. We also should not ignore the rapidity with which the wealthy want to use AI to replace employees, who necessitate salaries and health benefits. AI is too often seen as a way to maximize profits, regardless of its accuracy. There are also policies that exist within universities, research centres, granting agencies, and academic journals that may prohibit the use of AI: using this technology to conduct a systematic review, for example, might violate one or more of these policies.

Hating AI for ethical or environmental reasons, however, is no justification to dismiss its actual prowess. Calling it a 鈥渇ancy autocomplete鈥� is, I think, an oversimplification. Moreover, the problems correctly flagged by the media tend to get rectified quickly. Remember the extra fingers in AI-generated images? They are rare now. We have gone from artificial images looking like something out of聽The Sims聽to AI-generated, photorealistic video, complete with sound and music, that can fool most people.聽

The technology improves quickly.

The real problem is that AI鈥檚 proficiency will dull our critical thinking skills, fooling us into trusting its output and overlooking smaller yet significant mistakes.

And when this is used in the service of scientific research or medical practice, tiny errors can lead to wasted money, important delays, and actual harm. The lack of reproducibility in many of the answers I got is a key problem. It may get solved soon but for now it should be top of mind.

When I wrote to Amy Bergeron, the librarian mentioned earlier, she told me why she shows the contradictory answers she got from Consensus in talks she gives. 鈥淚 use this example concretely to encourage users to use these tools to retrieve references but not have too much faith in the summaries generated鈥攖hey should instead go actually read the retrieved results. That's really the bottom line I want them to retain.鈥�

At this point, AI can assist, but don鈥檛 blindly trust a machine.

Note: An earlier version of this article noted that Consensus' Pro and Deep modes always use the full text from an academic paper, which is not true.

Take-home message:
-A number of platforms using artificial intelligence are aimed at academics and claim not to hallucinate scientific papers that do not exist, because the AI is used after a search of the literature has been completed
-I tested one of these platforms, Consensus, which overall performed well but showed a number of problems: a misleading Consensus Meter; a generosity toward debunked practices; and a lack of consistency in its answers depending on how the question was phrased, what mode was used, or even with the same repeated question
-I also compared eight AI platforms on four science-related questions and their overall accuracy was really good, with minor exceptions

Keywords:

artificial intelligence