Who's watching you. Dark Web
11-30-2007, 02:17 AM
Who's watching you. Dark Web
Dark Web Terrorism Research
The AI Lab Dark Web project is a long-term scientific research program that aims to study and understand the international terrorism (Jihadist) phenomena via a computational, data-centric approach. We aim to collect "ALL" web content generated by international terrorist groups, including web sites, forums, chat rooms, blogs, social networking sites, videos, virtual world, etc.
We have developed various multilingual data mining, text mining, and web mining techniques to perform link analysis, content analysis, web metrics (technical sophistication) analysis, sentiment analysis, authorship analysis, and video analysis in our research.
The approaches and methods developed in this project contribute to advancing the field of Intelligence and Security Informatics (ISI). Such advances will help related stakeholders to perform terrorism research and facilitate international security and peace.
It is our belief that we (US and allies) are facing the dire danger of losing the "The War on Terror" in cyberspace (especially when many young people are being recruited, incited, infected, and radicalized on the web) and we would like to help in our small (computational) way.
We thank the following agencies for providing research funding support.
National Science Foundation (NSF) September 2003 August 2010
* (CRI: CRD) Developing a Dark Web Collection and Infrastructure for Computational and Social Sciences (NSF # CNS-0709338)
* (EXP-LA) Explosives and IEDs in the Dark Web: Discovery, Categorization, and Analysis (NSF # CBET-0730908)
* (SGER) Multilingual Online Stylometric Authorship Identification: An Exploratory Study (NSF # IIS-0646942)
* (ITR, Digital Government) COPLINK Center for Intelligence and Security Informatics Research (partial support) (NSF # EIA-0326348)
Library of Congress July 2005 June 2008
* Capture of Multimedia, Multilingual Open Source Web-based At-Risk Content
DHS / CNRI October 2003 - September 2005
* BorderSafe Initiative (partial support)
We thank the following academic partners and colleagues for their support, help, and comments. Many of our terrorism research colleagues have taught us much about the significance and intricacy of this important domain. They also help guide us in the development of our scientific, computational approach.
Officers and domain experts of Tucson Police Department, Arizona Department of Customs and Border Protection, and San Diego Automatec Regional Justice Information System (ARJIS) Program
Dr. Marc Sageman, University of Pennsylvania
Dr. Edna Reid, Clarion University
Dr. Joshua Sinai, The Analysis Corporation
Dr. Shlomo Argamon, Illinois Institute of Technology
Chip Ellis, Memorial Institute for the Prevention of Terrorism (MIPT)
Rex Hudson, Library of Congress
Dr. Chris Yang, Drexel University
Dr. Gabriel Weimann, University of Haifa, Israel
Dr. Mark Last, Ben-Gurion University, Israel
Drs. Henrik Larsen and Nasrullah Memon, Aalborg University, Denmark
Dr. Katrina von Knop, George Marshall Center, Germany
Dr. Jau-Hwang Wang and Robert Chang, Central Police University, Taiwan
Dr. Ee peng Lim, Nanyang Technological University, Singapore
Dr. Feiyue Wang, Chinese Academy of Sciences, China
Dr. Michael Chau, Hong Kong University
There has been significant interest from various intelligence, justice, and defense agencies in our computational methodologies, tools, and systems. However, we do not perform (security) clearance-level work nor do we conduct targeted cyber space crime or intelligence investigations. Our research staff members are primarily computer and information scientists from all over the world, and have expertise in more than 10 languages. We perform academic research, write papers (see below), and develop computer programs. We sincerely hope that our work can contribute to international security and peace.
Approach & Methodology
Claims: Dr. Gabriel Weimann of the University of Haifa has estimated that there are about 5,000 terrorist web sites as of 2006. Based on our actual spidering experience over the past 5 years, we believe there are about 50,000 sites of extremist and terrorist content as of 2007, including: web sites, forums, blogs, social networking sites, video sites, and virtual world sites (e.g., Second Life). The largest increase in 2006-2007 is in various new Web 2.0 sites (forums, videos, blogs, virtual world, etc.) in different languages (i.e., for home-grown groups, particularly in Europe). We have found significant terrorism content in more than 15 languages.
Testbed: We collect (using computer programs) various web contents every 2 to 3 months; we started spidering in 2002. Currently we only collect the complete contents of about 1,000 sites, in Arabic, Spanish, and English languages. We also have partial contents of about another 10,000 sites. In total, our collection is about 2 TBs in size, with close to 500,000,000 pages/files/postings from more than 10,000 sites.
We believe our Dark Web collection is the largest open-source extremist and terrorist collection in the academic world. (We have no way of knowing what the intelligence, justice, and defense agencies are doing.) Researchers can have graded access to our collection by contacting our research center.
Our web site collection consists of the complete contents of about 1,000 sites, in various static (html, pdf, Word) and dynamic (PHP, JSP, CGI) formats. We collect every single page, link, and attachment within these sites. We also collect partial information from about 10,000 related (linked) sites. Some large well-known sites contain more than 10,000 pages/files in 10+ languages (in selected pages).
We collect the complete contents (authors, headings, postings, threads, time-tags, etc.) of about 300 terrorist forums. We also perform periodic updates. Some large radical sites include more than 30,000 members with close to 1,000,000 messages posted.
Blogs, social networking sites, and virtual worlds:
We have identified and extracted many smaller, transient (meaning, the sites appear and disappear very quickly) blogs and social networking sites, mostly hosted by terrorist sympathizers and wannabes. We have also identified more than 30 (self-proclaimed) terrorist or extremist groups in virtual world sites. (However, we are still unsure whether they are real terrorist/extremists or just playing the roles in virtual games.)
Videos and multimedia content:
Terrorist sites are extremely rich in content, with heavy usage of multimedia formats. We have identified and extracted about 1,000,000 images and 15,000 videos from many terrorist sites and specialty multimedia file-hosting third-party servers. More than 50% of our videos are IED (Improvised Explosive Devices) related.
Computational Techniques: (Data Mining, Text Mining, and Web Mining)
Our computational tools are grouped in two categories: I. Collection; and II. Analysis and Visualization.
Web site spidering:
We have developed various focused spiders/crawlers based on our previous digital library research. Our spiders can access password-protected sites and perform randomized (human-like) fetching. Our spiders are trained to fetch all html, pdf, and word files, links, PHP, CGI, and ASP files, images, audios, and videos in a web site. To ensure freshness, we spider selected web sites every 2 to 3 months.
Our forum spidering tool recognizes 15+ forum hosting software and their formats. We collect the complete forum including: authors, headings, postings, threads, time-tags, etc., which allow us to re-construct participant interactions. We perform periodic forum spidering and incremental updates based on research needs. We have collected and processed forum contents in Arabic, English, Spanish, French, and Chinese using selected computational linguistics techniques.
Multimedia (image, audio, & video) spidering:
We have developed specialized techniques for spidering and collecting multimedia files and attachments from web sites and forums. We plan to perform stenography research to identify encrypted images in our collection and multimedia analysis (video segmentation, image recognition, voice/speech recognition) to identify unique terrorist-generated video contents and styles.
II. Analysis and Visualization:
Social network analysis (SNA):
We have developed various SNA techniques to examine web site and forum posting relationships. We have used various topological metrics (betweeness, degree, etc.) and properties (preferential attachment, growth, etc.) to model terrorist and terrorist site interactions. We have developed several clustering (e.g., Blockmodeling) and projection (e.g., Multi-Dimensional Scaling, Spring Embedder) techniques to visualize their relationships. Our focus is on understanding Dark Networks (unlike traditional bright scholarship, email, or computer networks) and their unique properties (e.g., hiding, justice intervention, rival competition, etc.).
We have developed several detailed (terrorism-specific) coding schemes to analyze the contents of terrorist and extremist web sites. Content categories include: recruiting, training, sharing ideology, communication, propaganda, etc. We have also developed computer programs to help automatically identify selected content categories (e.g., web master information, forum availability, etc.).
Web metrics analysis:
Web metrics analysis examines the technical sophistication, media richness, and web interactivity of extremist and terrorist web sites. We examine technical features and capabilities (e.g., their ability to use forms, tables, CGI programs, multimedia files, etc.) of such sites to determine their level of web-savvy-ness. Web metrics provides a measure for terrorists/extremists capability and resources. All terrorist site web metrics are extracted and computed using computer programs.
Sentiment and affect analysis:
Not all sites are equally radical or violent. Sentiment (polarity: positive/negative) and affect (emotion: violence, racism, anger, etc.) analysis allows us to identify radical and violent sites that warrant further study. We also examine how radical ideas become infectious based on their contents, and senders and their interactions. We reply much on recent advances in Opinion Mining analyzing opinions in short web-based texts. We have also developed selected visualization techniques to examine sentiment/affect changes in time and among people. Our research includes several probabilistic multilingual affect lexicons and selected dimension reduction and projection (e.g., Principal Component Analysis) techniques.
Authorship analysis and Writeprint:
Grounded in authorship analysis research, we have developed the (cyber) Writeprint technique to uniquely identify anonymous senders based on the signatures associated with their forum messages. We expand the lexical and syntactic features of traditional authorship analysis to include system (e.g., font size, color, web links) and semantic (e.g., violence. racism) features of relevance to online texts of extremists and terrorists. We have also developed advanced Inkblob and Writeprint visualizations to help visually identify web signatures. Our Writeprint technique has been developed for Arabic, English, and Chinese languages. The Arabic Writeprint consists of more than 400 features, all automatically extracted from online messages using computer programs. Writeprint can achieve an accuracy level of 95%.
A significant portion of our videos are IED related. Based on previous terrorism ontology research, we have developed a unique coding scheme to analyze terrorist-generated videos based on the contents, production characteristics, and meta data associated with the videos. We have also developed a semi-automated tool to allow human analysts to quickly and accurately analyze and code these videos.
IEDs in Dark Web analysis:
We have conducted several systematic studies to identify IED related content generated by terrorist and insurgency groups in the Dark Web. A smaller number of sites are responsible for distributing a large percentage of IED related web pages, forum postings, training materials, explosive videos, etc. We have developed unique signatures for those IED sites based on their contents, linkages, and multimedia file characteristics. Much of the content needs to be analyzed by military analysts. Training materials also need to be developed for troops before their deployment (seeing the battlefield from your enemies eyes).
Team Members (selected)
Dr. Hsinchun Chen firstname.lastname@example.org
Cathy Larson email@example.com
Ahmed Abbasi firstname.lastname@example.org
Tianju Fu email@example.com
David Zimbra firstname.lastname@example.org
Sven Thoms email@example.com
Yi-Da Chen firstname.lastname@example.org
Ben Zheren Hu email@example.com
Hsinmin Lu firstname.lastname@example.org
Alumni Team Members
Enrique Arevalo Dr. Rob Schumaker
Alfonso A. Bonillas Danning Hu
Dr. Wingyan Chung Dr. Yilu Zhou
Carrie Fang Arab Salim
Dr. Guanpi Lai (Greg) Dr. Edna Reid
Dr. Dan McDonald Lu Tseng
Dr. Jialun Qin Kira Joslin
Dr. Jennifer Jie Xu
Press and Publications
Press and Media:
Dark Web research has been featured in many national, international and local press and media, including: National Science Foundation press, Associated Press, BBC, Fox News, National Public Radio, Science News, Discover Magazine, Information Outlook, Wired Magazine, The Bulletin (Australian), Australian Broadcasting Corporation, Arizona Daily Star, East Valley Tribune, Phoenix ABC Channel 15, and Tucson Channels 4, 6, and 9. See our Recognitions page for links to these and other stories. Our research has been recognized for its contribution to national security.
As an NSF-funded research project, our research team has generated significant findings and publications in major computer science and information systems journals and conferences. However, we have taken great care not to reveal sensitive group information or technical implementation details (specifics). We hope our research will help educate the next generation of cyber/Internet savvy analysts and agents in the intelligence, justice, and defense communities.
A Few Words about Civil Liberties and Human Rights: The Dark Web project is NOT like Total Information Awareness (TIA) (at least we try very hard not to be like it). This is not a secretive government project conducted by spooks. We perform scientific, longitudinal hypothesis-guided terrorism research like other terrorism researchers (who have done such research for 30+ years). However we are clearly more computationally-oriented; unlike other traditional terrorism research that relies on sociology, communications, and policy based methodologies. Our contents are open source in nature (similar to Googles contents) and our major research targets are international, Jihadist groups, not regular citizens. Our researchers are primarily computer and information scientists from all over the world. We develop computer algorithms, tools, and systems. Our research goal is to study and understand the international extremism and terrorism phenomena. Some people may refer to this as understanding the root cause of terrorism.
The following books and papers can be found easily from various academic sources:
I. Books (Monograph, Edited Volume, and Proceedings): Intelligence and Security Informatics (ISI) related; Dark Web research included.
H. Chen and C. Yang (Eds.), Intelligence and Security Informatics, Springer, forthcoming, 2008.
H. Chen, E. Reid, J. Sinai, A. Silke, and B. Ganor (Eds.), Terrorism Informatics: Knowledge Management and Data Mining for Homeland Security, Springer, forthcoming, 2008.
H. Chen, T. S. Raghu, R. Ramesh, A. Vinze, and D. Zeng (Eds.), Handbooks in Information Systems -- National Security, Elsevier Scientific, 2007.
C. Yang, D. Zeng, M. Chau, K. Chang, Q. Yang, X. Cheng, J. Wang, F. Wang, and H. Chen. (Eds.), Intelligence and Security Informatics, Proceedings the Pacific-Asia Workshop, PAISI 2007, Lecture Notes in Computer Science (LNCS 4430), Springer-Verlag, 2007.
S. Mehrotra, D. Zeng, H. Chen, B. Thursaisingham, and F. Wang (Eds.), Intelligence and Security Informatics, Proceedings of the IEEE International Conference on Intelligence and Security Informatics, ISI 2006, Lecture Notes in Computer Science (LNCS 3975), Springer-Verlag, 2006.
H. Chen, F. Wang, C. Yang, D. Zeng, M. Chau, and K. Chang (Eds.), Intelligence and Security Informatics, Proceedings of the Workshop on Intelligence and Security Informatics, WISI 2006, Lecture Notes in Computer Science (LNCS 3917), Springer-Verlag, 2006.
H. Chen, Intelligence and Security Informatics for International Security: Information Sharing and Data Mining, Springer, 2006.
P. Kantor, G. Muresan, F. Roberts, D. Zeng, F. Wang, H. Chen, and R. Merkle (Eds.), Intelligence and Security Informatics, Proceedings of the IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Lecture Notes in Computer Science (LNCS 3495), Springer-Verlag, 2005.
H. Chen, R. Moore, D. Zeng, and J. Leavitt (Eds.), Intelligence and Security Informatics, Proceedings of the Second Symposium on Intelligence and Security Informatics, ISI 2004, Lecture Notes in Computer Science (LNCS 3073), Springer-Verlag, 2004.
H. Chen, R. Miranda, D. Zeng, T. Madhusudan, C. Demchak, and J. Schroeder (Eds.), Intelligence and Security Informatics, Proceedings of the First NSF/NIJ Symposium on Intelligence and Security Informatics, ISI 2003, Lecture Notes in Computer Science (LNCS 2665), Springer-Verlag, 2003.
II. Journal Articles (published and forthcoming):
Abbasi, A., Chen, H., and Salem, A. "Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums." ACM Transactions on Information Systems, forthcoming, 2008.
Reid, E. and H. Chen, Contemporary Terrorism Researchers Patterns of Collaboration and Influence, Journal of the American Society for Information Science and Technology, forthcoming, 2008.
Schumaker, R. and H. Chen, Leveraging Question Answer Technology to Address Terrorism Inquiry, Decision Support Systems, forthcoming, 2008.
Reid, E. and Chen, H., "Mapping the Contemporary Terrorism Research Domain." International Journal of Human-Computer Studies, 65, Pages 42-56, 2007.
Qin, J., Zhou, Y., Reid, E., Lai, G., Chen, H., "Analyzing Terror Campaigns on the Internet: Technical Sophistication, Content Richness, and Web Interactivity," International Journal of Human-Computer Studies, 65, Pages 71-84, 2007.
Reid, E. and Chen, H. "Internet-Savvy U.S. and Middle Eastern Extremist Groups." Mobilization: An International Quarterly, 12(2), pp. 177-192, 2007.
Li, J., R. Zheng, and H. Chen, From Fingerprint to Writeprint, Communications of the ACM, Volume 49, Number 4, Pages 76-82, April 2006.
Zheng, R., J. Li, H. Chen, and Z. Huang, A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques, Journal of the American Society for Information Science and Technology, Volume 57, Number 3, Pages 378-393, 2006.
H. Chen and F. Wang, "Artificial Intelligence for Homeland Security",IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 12-16, September/October 2005.
A. Abbasi and H. Chen, "Applying Authorship Analysis to Extremist-Group Web Forum Messages",IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 67-75, September/October 2005.
Zhou, Y., Reid, E., Qin, J., Lai, G., Chen, H., U.S. Domestic Extremist Groups on the Web: Link and Content Analysis,IEEE Intelligent Systems, Special Issue on Artificial Intelligence for National and Homeland Security, pp. 44-51, September/October 2005.
III. Conference papers:
T. Fu., A. Abbasi, and H. Chen. "Interaction Coherence Analysis for Dark Web Forums," in Proceedings of the 2007 IEEE Intelligence and Security Informatics Conference, New Brunswick, NJ, May 23-24, 2007, p. 342-349.
A. Abbasi and H. Chen. "Categorization and Analysis of Text in Computer Mediated Communication Archives Using Visualization," in Proceedings of the 2007 Joint Conference on Digital Libraries (JCDL), Vancouver, BC, Canada, June 18-23, 2007, p. 11-18.
A. Abbasi and H. Chen, "Visualizing Authorship for Identification," In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.
J. Wang, T. Fu, H. Lin, and H. Chen, "A Framework for Exploring Gray Web Forums: Analysis of Forum-Based Communities in Taiwan," In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.
Y. Zhou, J. Qin, G. Lai, E. Reid, and H. Chen, "Exploring the Dark Side of the Web: Collection and Analysis of U.S. Extremist Online Forums," In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.
A. Salem, E. Reid, and H. Chen, "Content Analysis of Jihadi Extremist Groups' Videos," In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.
J. Xu, H. Chen, Y. Zhou, and J. Qin, "On the Topology of the Dark Web of Terrorist Groups," In Proceedings of the Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics (ISI 2006), San Diego, CA, USA, May 23-24, 2006.
Zhou, Y., Qin, J., Lai, G., Reid E. and Chen, H., "Building Knowledge Management System for Researching Terrorist Groups on the Web," Proceedings of the AIS Americas Conference on Information Systems (AMCIS 2005) , Omaha, NE, USA, August 11-14, 2005.
Mapping the Contemporary Terrorism Research Domain: Researchers, Publications, and Institutions Analysis, ISI Conference 2005, Atlanta, GA, May, 2005.
Abbasi, A & Chen, H. 2005. "Applying Authorship Analysis to Arabic Web Content." ISI Conference 2005, Atlanta, GA, May, 2005.
Reid, E., Qin, J., Zhou, Y., Lai, G., Sageman, M., Weimann, G., and Chen, H., "Collecting and Analyzing the Presence of Terrorists on the Web: A Case Study of Jihad Websites," IEEE International Conference on Intelligence and Security (ISI 2005), Atlanta, Georgia, 2005.
Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., Lai, G., Bonillas, A. and Sageman, M., "The Dark Web Portal: Collecting and Analyzing the Presence of Domestic and International Terrorist Groups on the Web," Proceedings of the 7th International Conference on Intelligent Transportation Systems (ITSC), Washington D.C., October 3-6, 2004.
E. Reid, J. Qin, W. Chung, J. Xu, Y. Zhou, R. Schumaker, M. Sageman, H. Chen, "Terrorism Knowledge Discovery Project: A Knowledge Discovery Approach to Addressing the Threats of Terrorism," Proceedings of the Second Symposium on Intelligence and Security Informatics, June 10-11, 2004, Tucson, AZ, 2004, pp. 125-145.
H. Chen, "The Terrorism Knowledge Portal: Advanced Methodologies for Collecting and Analyzing Information from the Dark Web and Terrorism Research Resources," presented at the Sandia National Laboratories, August 14, 2003.
IV. Presentations in Seminars or Conferences (PowerPoint) Password protected; please send request via email and provide a brief explanation of your interest.
Affect and Sentiment Analysis of Web Forums, July, 2007.
Large-scale Forum Analysis of Selected Radical Sites, March, 2007.
Explosives and IEDs in the Dark Web: Discovery, Categorization, and Analysis, Febuary, 2007.
ClearGuidance.com Analysis, September, 2006.
Writeprints and Ink Blots: Visualizing Authorship for Identification and Authentication, Tucson, August, 2005.
Data Mining & Webometric Analysis of Terrorist/Extremist Groups Digital Artifacts, Singapore, August 2005.
Applying Authorship Identification to Web Forums: Analysis of English and Arabic Extremist Group Postings, Tucson, April, 2005.
Content and Link Analysis of Domestic and International terrorism Websites, Tucson, AZ, March 23, 2005.
Advanced Methodology for Collecting and Analyzing, Information from the "Dark Web", Tucson, AZ, Feb 10, 2005.
Multilingual Authorship Analysis for Web Content: A Comparison of English and Arabic Language Models, Tucson, December, 2004.
For access to our testbed, recent presentations and publications, and selected demos, please contact our research center:
Ms. Cathy Larson,
User(s) browsing this thread: 1 Guest(s)