Estimating the Internet malicious host population while preserving privacy
AffiliationDepartment of Computing and Information Systems, Faculty of Engineering
Document TypePhD thesis
CitationsWahid, A. (2013). Estimating the Internet malicious host population while preserving privacy. PhD thesis, Department of Computing and Information Systems, The University of Melbourne.
Access StatusOpen Access
© 2013 Dr. Alif Wahid
The Internet is a globally significant infrastructure that attracts a large number of threats posed by the population of malicious hosts within it. These threats scale with the size of the malicious host population, which makes the accurate estimation of this population an important challenge. The difficulty of this challenge is further compounded by the conflicting requirements of preserving the privacy of bystanders associated with malicious host behaviour while accurately identifying malicious host instances across the Internet. In this thesis, we address this challenge of estimating the Internet malicious host population while preserving privacy. We begin by identifying four major research problems that have not been addressed in the literature. First is the lack of a model for host-to-address bindings. Second is the characterisation of malicious address properties. Third is the correlation of independent measurements. And fourth is the development of dynamic countermeasures. We subsequently proceed to develop novel solutions corresponding to the first three problems, while the fourth remains to be addressed in the future. Our first contribution is the development of a probabilistic model for host-to-address bindings, which allows the number of hosts that attached to an observed address to be inferred based on privacy preserving data sets and a publicly accessible ground truth. We demonstrate the properties of this model in terms of preferential attachment and point out its primary benefit in terms of enabling the inference of host behaviour based only on address characteristics, which is a necessary condition for privacy preservation. However, this leads to the need for an understanding of various address characteristics in order to draw reliable and robust inferences. Our second contribution is the analysis of a large repository of intrusion alerts from globally distributed vantage points that provide access to various characteristics of malicious addresses. We find that alerted addresses are active for very short periods in the order of a few minutes and that they rarely appear more than once. We also find that there are statistically self-similar properties corresponding to these addresses in terms of non-existent temporal and spatial clusters. The main implication is that intrusion alerts contain the necessary information for use with our model of host-to-address bindings but lack sufficient robustness for reliably estimating the number of malicious hosts corresponding to an address due to the presence of spoofed and inactive sources. Our third contribution is the combined analysis of passive measurements in the form of intrusion alerts with active measurements in the form of ping responses in order to identify those addresses that are active, attached, allocated and malicious simultaneously across two different data sets gathered independently. Our guiding hypothesis is that intrusion alerts and ping responses are different behavioural aspects of the same underlying malicious hosts. Subsequently, we apply our probabilistic model of host-to-address bindings to this intersected data set and find that more than 80\% of observable addresses bind to multiple hosts, and that the distribution of malicious hosts across the IPv4 address space is highly non-uniform. This has major implications for the widespread use of blacklisting to counter the threat posed by malicious hosts, since the information used to blacklist various sources usually expires quickly. The aforementioned overall contributions of this thesis collectively form a methodology for estimating the number of malicious hosts corresponding to an observed address while preserving privacy. We demonstrate that we can achieve a reasonable accuracy of estimation while maintaining the privacy of all associated users. Our work is also based on openly accessible data sets and ground truth, which enables reproducibility of our results. We also demonstrate that this is broadly applicable within the Internet infrastructure that exists today.
Keywordsintrusion detection; privacy protection; network security
- Click on "Export Reference in RIS Format" and choose "open with... Endnote".
- Click on "Export Reference in RIS Format". Login to Refworks, go to References => Import References