Computing and Information Systems - Theses

Permanent URI for this collection

Search Results

Now showing 1 - 1 of 1
  • Item
    Thumbnail Image
    High-quality lossless web page template and data separation
    Zhao, Chenxu ( 2018)
    Web page separation is an important task that aims to separate a web page into template code and data records populated into the template. Web page separation needs to work in a lossless manner where the web page can be reconstructed by running the template code on the data records. In this thesis, we investigate two sub-problems of web page separation for obtaining (1) high-quality template code and (2) high-quality data records. For the first sub-problem, we focus on improving the maintainability of the template code. Easily maintainable template code is reliable and will simplify further developments on top of the template code, e.g., to update the web templates. We formulate such a problem and analyze its complexity. We show that this problem is NP-hard. We then propose a heuristic algorithm to solve the problem. The main idea of our algorithm is to parse a web page into a tree and then to process it recursively in a bottom-up manner with three steps: splitting, folding, and alignment. In particular, we split siblings in the tree and fold them into chunks, where the alignment step is used to align sibling in the same chunk. During the sibling splitting step, to determine which siblings should be grouped into the same chunk, we further propose a population-based optimization algorithm named dual teaching and learning based optimization. We perform experiments on real data sets to evaluate the performance of our proposed algorithms in maximizing the maintainability of the template code produced. Experimental results show that our proposed algorithms outperform the baseline algorithms in the maintainability measure. For the second sub-problem, we focus on extracting data records from a set of web pages which are generated by different unknown templates and deducing the schemas that provide the data records. The extracted data records can be used in many applications, such as stock market prediction and personalized recommendation systems. We formulate such a problem and propose a framework to tackle the problem. Our framework processes web pages with four steps: web page template and data separation, template clustering, template alignment, and data record filtering. The web page template and data separation step separates web pages into template code and data records. The template clustering step then clusters the web pages by the similarity of template code. The template alignment step captures the differences among templates to construct a generalized template code which can generate all web pages in the same group. The data filtering step utilizes the template code to verify the data records extracted by the web page template and data separation step and modifies those which are incorrectly extracted. We perform experiments on real data sets to evaluate the performance of our framework. Experimental results show that our proposed framework outperforms baseline algorithms which assume a pre-known clustering of the set of web pages in the F-Score.