StreetMath: Study of LLMs’ Approximation Behaviors

Tseng, Chiung‑Yi; Roy, Somshubhra; Thasin, Maisha; Zhang, Danyang; Effiong, Blessing

StreetMath: Study of LLMs’ Approximation Behaviors

¹ Luxmuse AI
² Department of Electrical and Computer Engineering, North Carolina State University
³ Department of Math, University of Waterloo
⁴ ByteDance Inc.
⁵ Department of Computer Science, Saint Louis University
Preprint 2025

Paper (PDF) Code arXiv

Abstract

There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models’ approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath

BibTeX

@article{StreetMath2025,
title={StreetMath: Study of LLMs' Approximation Behaviors},
author={Chiung-Yi Tseng and Somshubhra Roy and Maisha Thasin and Danyang Zhang and Blessing Effiong},
year={2025},
url={https://github.com/ctseng777/StreetMath}
}

Linear probe on digit paraphrase near 5.

Linear probe on digit paraphrase near 10.

Linear probe on word paraphrase near 5.

Linear probe on word paraphrase near 10.

Pruning analysis on mamba-GPT-3B.

Pruning analysis on Qwen3-4B-Instruct-2507.

Pruning analysis on Qwen3-4B-Thinking-2507.

Pruning analysis on Dream-v0-Instruct-7B.

Pruning analysis on Falcon-Mamba-7B-Instruct.

Poster

References

Surveys and Benchmarks

Ahn, J., Verma, R., Lou, R., Liu, D., Zhang, R., Yin, W. (2024). Large language models for mathematical reasoning: Progresses and challenges. arXiv:2402.00157.
Lewkowycz, A., Andreassen, A., Dohan, D., et al. (2022). Solving quantitative reasoning problems with language models. NeurIPS 35:3843–3857.
Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300.
Srivastava, G., Hussain, A., Srinivasan, S., Wang, X. (2024). LMThinkBench: Towards basic math reasoning and overthinking in LLMs. arXiv:2507.04023.
Paster, K., Dos Santos, M., Azerbayev, Z., Ba, J. (2023). OpenWebMath: An open dataset of high-quality mathematical web text. arXiv:2310.06786.
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., et al. (2024). GSM‑Symbolic: Understanding the limitations of mathematical reasoning in LLMs. arXiv:2410.05229.

Arithmetic and Number Representations

Gambardella, L. et al. (2024). Language models do hard arithmetic. arXiv preprint.
Levy, O., Geva, M. (2024). Language models encode numbers. arXiv preprint.
Levy, A. A., Geva, M. (2025). Language models encode numbers using digit representations in base 10. NAACL 2025 (Short Papers).
Zhou, T., Fu, D., Sharan, V., Jia, R. (2024). Pre‑trained LLMs use Fourier features to compute addition. arXiv:2406.03445.
Kantamneni, S., Tegmark, M. (2025). Language models use trigonometric functions. arXiv preprint.
Lauter, K. et al. (2024). Machine learning for modular arithmetic. arXiv preprint.
Nikankin, A. et al. (2025). Arithmetic without algorithms. arXiv preprint.
Zhu, W. et al. (2025). Language models encode the concept of numeric magnitude. arXiv preprint.
Shah, R. et al. (2023). Numeric magnitude comparison. arXiv preprint.

Mechanistic Interpretability

Alain, G., Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644.
Hewitt, J., Manning, C. D. (2019). A structural probe for finding syntax in word representations. NAACL 2019:4129–4138.
Yu, Z., Ananiadou, S. (2024). Interpreting arithmetic mechanism in LLMs through comparative neuron analysis. EMNLP 2024:3293–3306.
Rai, D., Zhou, Y., Feng, S., Saparov, A., Yao, Z. (2025). A practical review of mechanistic interpretability for transformer‑based LMs. arXiv:2407.02646.
Skean, O., Arefin, M. R., Zhao, D., et al. (2025). Layer by layer: Uncovering hidden representations in language models. arXiv:2502.02013.
Belinkov, Y., Glass, J. (2019). Analysis methods in neural language processing: A survey. TACL 7:49–72.
Christ, B. R., Gottesman, Z., Kropko, J., Hartvigsen, T. (2025). Math neurosurgery: Isolating LMs’ math reasoning abilities using only forward passes. arXiv:2410.16930.

Reasoning and Tool Use

Chen, W., Ma, X., Wang, X., Cohen, W. W. (2022). Program of Thoughts prompting. arXiv:2211.12588.
Gao, L., Madaan, A., Zhou, S., et al. (2023). PAL: Program‑aided language models. ICML 2023.
Schick, T., Dwivedi‑Yu, J., Dessà, R., et al. (2023). Toolformer: LMs can teach themselves to use tools. arXiv:2302.04761.
Das, D., Banerjee, D., Manocha, S., Baral, A. (2024). MathSensei: A tool‑augmented LLM for mathematical reasoning. arXiv:2402.17231.
Ding, M., Liu, H., Fu, Z., et al. (2024). Break the chain: LLMs can be shortcut reasoners. arXiv:2406.06580.
McCoy, R. T., Pavlick, E., Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in NLI. arXiv:1902.01007.
Dietz, M., Klakow, D. (2025). IGC: Integrating a gated calculator. arXiv preprint.
Saynova, A. et al. (2025). Fact recall, heuristics or pure computation. arXiv preprint.

Architectures and Models

Qwen Team (2025). Qwen3 Technical Report. arXiv:2505.09388.
Jelassi, S., Brandfonbrener, D., Kakade, S. M., Malach, E. (2024). Transformers are better than state space models at copying. ICML 2024:21502–21521.
Li, J. et al. (2025). Diffusion language models. arXiv preprint.
Ye, J., Xie, Z., Zheng, L., et al. (2025). Dream‑7B: Diffusion large language models. arXiv:2508.15487.
Zuo, J., Velikanov, M., Rhaiem, D. E., et al. (2024). Falcon‑Mamba: The first competitive attention‑free 7B language model. arXiv:2410.05355.
CobraMamba (2023). Mamba‑GPT‑3B. Hugging Face model card.

Efficiency and Cognitive Psychology

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Fiske, S. T., Taylor, S. E. (1991). Social Cognition. McGraw‑Hill.
Moyer, R. S., Landauer, T. K. (1967). Time required for judgements of numerical inequality. Nature 215(5109):1519–1520.
Jiang, D. L., Ye, S., Zhao, L., Gu, B. (2025). Evidence of cognitive miser behavior from a natural experiment. Information Systems Research.
Teerapittayanon, S., McDanel, B., Kung, H.‑T. (2016). BranchyNet: Fast inference via early exiting. ICPR 2016:2464–2469.
Zhao, W., Guo, J., Deng, Y., et al. (2024). Inherent efficiency within large reasoning models. arXiv:2506.15647.
Roy, O., Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. EUSIPCO 2007:606–610.

Calibration and Evaluation

Lovering, C., Krumdick, M., Lai, V. D., et al. (2024). Language model probabilities are not calibrated in numeric contexts. arXiv:2410.16007.
Lovering, C. et al. (2024). Language model probabilities. arXiv preprint.
Goldberg, Y. (2016). A primer on neural network models for NLP. JAIR 57:345–420.
Sun, X. et al. (2025). Probing for arithmetic errors in language models. arXiv preprint.

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

StreetMath: Study of LLMs’ Approximation Behaviors

Abstract

BibTeX

Linear probe on digit paraphrase near 5.

Linear probe on digit paraphrase near 10.

Linear probe on word paraphrase near 5.

Linear probe on word paraphrase near 10.

Pruning analysis on mamba-GPT-3B.

Pruning analysis on Qwen3-4B-Instruct-2507.

Pruning analysis on Qwen3-4B-Thinking-2507.

Pruning analysis on Dream-v0-Instruct-7B.

Pruning analysis on Falcon-Mamba-7B-Instruct.

Poster

References

Surveys and Benchmarks

Arithmetic and Number Representations

Mechanistic Interpretability

Reasoning and Tool Use

Architectures and Models

Efficiency and Cognitive Psychology

Calibration and Evaluation