Enhanced URL Preprocessing for Phishing Detection
This repository contains code for enhanced URL preprocessing techniques to improve phishing detection models.
Overview
The implementation builds upon the existing preprocessing pipeline and adds several advanced features:
-
Domain-specific features:
- Domain length and structure analysis
- Subdomain detection and analysis
- TLD (Top-Level Domain) analysis
- Common vs. suspicious TLD detection
-
URL structure analysis:
- Path length and depth analysis
- Query parameter analysis
- Fragment analysis
- IP address detection
- URL shortener detection
-
Advanced text analysis:
- Character distribution entropy
- Character type ratios
- Brand name detection
- Enhanced suspicious keyword detection
-
Advanced pattern detection:
- Excessive subdomain detection
- Numeric subdomain detection
- Hyphen analysis
- Repeated character detection
- Homoglyph attack detection
-
Feature engineering improvements:
- Feature normalization
- Feature importance calculation
- Dimensionality reduction (PCA, t-SNE)
- Visualization tools
Files
enhanced_url_preprocessing.py: Main implementation with all preprocessing functionsfeature_importance.png: Visualization of feature importance (generated when script is run)pca_visualization.png: PCA visualization of reduced dimensions (generated when script is run)
Usage
To use the enhanced preprocessing:
import pandas as pd
from enhanced_url_preprocessing import process_url_data, normalize_features
# Load your data
data = pd.read_csv("your_data.csv")
# Extract enhanced features
features_df = process_url_data(data, url_column="URL")
# Normalize features (optional)
normalized_df = normalize_features(features_df)
# Save processed features
features_df.to_csv("enhanced_features.csv", index=False)
Requirements
The implementation requires the following Python packages:
- numpy
- pandas
- matplotlib
- scikit-learn
- tldextract
Install with:
pip install numpy pandas matplotlib scikit-learn tldextract
Feature Details
Domain-specific Features
domain_length: Length of the domain namehas_subdomain: Whether the URL has a subdomainsubdomain_length: Length of the subdomainsubdomain_count: Number of subdomainstld_length: Length of the TLDis_common_tld: Whether the TLD is common (com, org, net, etc.)is_country_tld: Whether the TLD is a country codeis_suspicious_tld: Whether the TLD is commonly used in phishing
URL Structure Analysis
path_length: Length of the URL pathpath_depth: Depth of the URL path (number of directories)has_query: Whether the URL has query parametersquery_length: Length of the query stringquery_param_count: Number of query parametershas_fragment: Whether the URL has a fragmentfragment_length: Length of the fragmentis_ip_address: Whether the domain is an IP addressis_shortened: Whether the URL is shortened
Advanced Text Analysis
entropy: Shannon entropy of character distributionletter_ratio: Ratio of letters to total charactersdigit_ratio: Ratio of digits to total charactersspecial_char_ratio: Ratio of special characters to total characterscontains_brand: Whether the URL contains a popular brand name
Advanced Pattern Detection
excessive_subdomains: Whether the URL has excessive subdomainshas_numeric_subdomain: Whether the subdomain contains numbershas_hyphen_in_domain: Whether the domain contains hyphenshyphen_count: Number of hyphens in the domainhas_repeated_chars: Whether the domain has repeated charactershas_homoglyphs: Whether the domain contains homoglyphs (similar-looking characters)