StreetMath: Study of LLMs’ Approximation Behaviors

1 Luxmuse AI
2 Department of Electrical and Computer Engineering, North Carolina State University
3 Department of Math, University of Waterloo
4 ByteDance Inc.
5 Department of Computer Science, Saint Louis University
Preprint 2025

Abstract

There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models’ approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

BibTeX

@article{StreetMath2025,
title={StreetMath: Study of LLMs' Approximation Behaviors},
author={Chiung-Yi Tseng and Somshubhra Roy and Maisha Thasin and Danyang Zhang and Blessing Effiong},
year={2025},
url={https://github.com/ctseng777/StreetMath}
}

Poster

References

Surveys and Benchmarks

  • Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W. (2024). Large language models for mathematical reasoning: Progresses and challenges. arXiv:2402.00157.
  • Lewkowycz, A., Andreassen, A., Dohan, D., et al. (2022). Solving quantitative reasoning problems with language models. NeurIPS 35:3843–3857.
  • Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300.
  • Srivastava, G., Hussain, A., Srinivasan, S., Wang, X. (2024). LMThinkBench: Towards basic math reasoning and overthinking in LLMs. arXiv:2507.04023.
  • Paster, K., Dos Santos, M., Azerbayev, Z., Ba, J. (2023). OpenWebMath: An open dataset of high-quality mathematical web text. arXiv:2310.06786.
  • Mirzadeh, I., Alizadeh, K., Shahrokhi, H., et al. (2024). GSM‑Symbolic: Understanding the limitations of mathematical reasoning in LLMs. arXiv:2410.05229.

Arithmetic and Number Representations

  • Gambardella, L. et al. (2024). Language models do hard arithmetic. arXiv preprint.
  • Levy, O., Geva, M. (2024). Language models encode numbers. arXiv preprint.
  • Levy, A. A., Geva, M. (2025). Language models encode numbers using digit representations in base 10. NAACL 2025 (Short Papers).
  • Zhou, T., Fu, D., Sharan, V., Jia, R. (2024). Pre‑trained LLMs use Fourier features to compute addition. arXiv:2406.03445.
  • Kantamneni, S., Tegmark, M. (2025). Language models use trigonometric functions. arXiv preprint.
  • Lauter, K. et al. (2024). Machine learning for modular arithmetic. arXiv preprint.
  • Nikankin, A. et al. (2025). Arithmetic without algorithms. arXiv preprint.
  • Zhu, W. et al. (2025). Language models encode the concept of numeric magnitude. arXiv preprint.
  • Shah, R. et al. (2023). Numeric magnitude comparison. arXiv preprint.

Mechanistic Interpretability

  • Alain, G., Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.
  • Hewitt, J., Manning, C. D. (2019). A structural probe for finding syntax in word representations. NAACL 2019:4129–4138.
  • Yu, Z., Ananiadou, S. (2024). Interpreting arithmetic mechanism in LLMs through comparative neuron analysis. EMNLP 2024:3293–3306.
  • Rai, D., Zhou, Y., Feng, S., Saparov, A., Yao, Z. (2025). A practical review of mechanistic interpretability for transformer‑based LMs. arXiv:2407.02646.
  • Skean, O., Arefin, M. R., Zhao, D., et al. (2025). Layer by layer: Uncovering hidden representations in language models. arXiv:2502.02013.
  • Belinkov, Y., Glass, J. (2019). Analysis methods in neural language processing: A survey. TACL 7:49–72.
  • Christ, B. R., Gottesman, Z., Kropko, J., Hartvigsen, T. (2025). Math neurosurgery: Isolating LMs’ math reasoning abilities using only forward passes. arXiv:2410.16930.

Reasoning and Tool Use

  • Chen, W., Ma, X., Wang, X., Cohen, W. W. (2022). Program of Thoughts prompting. arXiv:2211.12588.
  • Gao, L., Madaan, A., Zhou, S., et al. (2023). PAL: Program‑aided language models. ICML 2023.
  • Schick, T., Dwivedi‑Yu, J., Dessà, R., et al. (2023). Toolformer: LMs can teach themselves to use tools. arXiv:2302.04761.
  • Das, D., Banerjee, D., Manocha, S., Baral, A. (2024). MathSensei: A tool‑augmented LLM for mathematical reasoning. arXiv:2402.17231.
  • Ding, M., Liu, H., Fu, Z., et al. (2024). Break the chain: LLMs can be shortcut reasoners. arXiv:2406.06580.
  • McCoy, R. T., Pavlick, E., Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in NLI. arXiv:1902.01007.
  • Dietz, M., Klakow, D. (2025). IGC: Integrating a gated calculator. arXiv preprint.
  • Saynova, A. et al. (2025). Fact recall, heuristics or pure computation. arXiv preprint.

Architectures and Models

  • Qwen Team (2025). Qwen3 Technical Report. arXiv:2505.09388.
  • Jelassi, S., Brandfonbrener, D., Kakade, S. M., Malach, E. (2024). Transformers are better than state space models at copying. ICML 2024:21502–21521.
  • Li, J. et al. (2025). Diffusion language models. arXiv preprint.
  • Ye, J., Xie, Z., Zheng, L., et al. (2025). Dream‑7B: Diffusion large language models. arXiv:2508.15487.
  • Zuo, J., Velikanov, M., Rhaiem, D. E., et al. (2024). Falcon‑Mamba: The first competitive attention‑free 7B language model. arXiv:2410.05355.
  • CobraMamba (2023). Mamba‑GPT‑3B. Hugging Face model card.

Efficiency and Cognitive Psychology

  • Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
  • Fiske, S. T., Taylor, S. E. (1991). Social Cognition. McGraw‑Hill.
  • Moyer, R. S., Landauer, T. K. (1967). Time required for judgements of numerical inequality. Nature 215(5109):1519–1520.
  • Jiang, D. L., Ye, S., Zhao, L., Gu, B. (2025). Evidence of cognitive miser behavior from a natural experiment. Information Systems Research.
  • Teerapittayanon, S., McDanel, B., Kung, H.‑T. (2016). BranchyNet: Fast inference via early exiting. ICPR 2016:2464–2469.
  • Zhao, W., Guo, J., Deng, Y., et al. (2024). Inherent efficiency within large reasoning models. arXiv:2506.15647.
  • Roy, O., Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. EUSIPCO 2007:606–610.

Calibration and Evaluation

  • Lovering, C., Krumdick, M., Lai, V. D., et al. (2024). Language model probabilities are not calibrated in numeric contexts. arXiv:2410.16007.
  • Lovering, C. et al. (2024). Language model probabilities. arXiv preprint.
  • Goldberg, Y. (2016). A primer on neural network models for NLP. JAIR 57:345–420.
  • Sun, X. et al. (2025). Probing for arithmetic errors in language models. arXiv preprint.