EBIB    05.02 / Bulletin full texts - EBIB No.1/2001

 

Duch, Wodzislaw: Neural networks papers on the Internet
Nicholas Copernicus University, Torun, Poland

It's no secret that the new generation of students and young researchers tries to avoid going to the library by searching for everything in the Internet. For them information that cannot be found in the Internet does not exist. Last year I've got an email from a USA high-school student asking "who was this Copernicus guy? I have to write a paper and can't find anything ...".

For the new generation Internet becomes not just the source of information, but the only source of information. Young researchers first check the Internet pages of journals to find relevant papers and only then - if they are lucky and have a good library subscribing to the particular journal - go to the library. Unfortunately good libraries are hard to find even in the USA and Western Europe, not to mention the Central European and developing countries. Quite a few people (myself included) moved from engineering, physics or mathematics to neural computing. The fact that they could collect many useful papers downloading them from Internet archives greatly facilitated this movement. Yet many "serious" scientist pay little attention to this new media, spending a significant amount of time on preparation of their articles and conference presentations and devoting virtually no time to maintain archives with their papers and web pages describing their projects. There are various prizes for the best papers but no prizes for the most useful WWW sites, hence little motivation to spend any time on developing and maintaining the sites.

Some conferences publish thousands of pages in thick volumes - who can read it all? It is faster to find the relevant papers in the Internet and print only those that one wants. Neural network societies (INNS, ENNS, APNNA, IEEE NNC), journals and large conferences still continue their business as usual: the WWW addresses of participants are almost never published, frequently even email addresses are not published, no information about useful WWW sites is ever mentioned, conference home pages rarely keep a table of contents of their proceedings. There are some signs of changes, at least in respect to IEEE conferences: a great new service, called IEEE Xplore (it has been mentioned in the editorial article of the IEEE TNN in January 2000) allows everyone to access tables of contents of IEEE Transactions, journals, magazines, conference proceedings and standards. IEEE members have also full access to articles and other material in the PDF format. These publications should be cross-linked using hyperlinks. This is a welcome extension to the IEEE bibliographies on-line and to the Opera services. The IEEE Xplore may be found at http://ieeexplore.ieee.org.

Although many experts maintain their local archives without a well-organized, central repository of catalogued papers they are not only hard to find, but also they do not have strong motivation to place their papers in local archive. Without the support of neural network community many useful initiatives to collect papers run out of steam and die. For example, the Neuroprose archive of papers, which started in 1989 and accumulated more than 600 papers and 53 PhD thesis has never been properly organized. The papers were automatically dropped to a huge catalog that finally grew to a size that prevented finding anything interesting unless one new the name of the file. The readme file was modified for the last time in 1994 and the archive itself seems to have died in 1998. Neuroprose was based on the ftp protocol which is out of date and frequently does not work properly due to various firewalls set up for security reasons. Uploading the files to one big catalog is certainly not a good long-term solution. There is no reason why the archives should not be accessible by the http protocol through the Web pages. The London and South-East center for High Performance Computing (SEL-HPC) has opened an archive storing papers related to parallel and functional programming, vision and image processing, computational mathematics, neural networks, human computer interaction and other subjects. The archive accumulated over 7000 papers and stored also the links to the home pages of people that kept their papers there. Unfortunately in 1998 the SEL-HPC center was shut down and although the archive is still there it does not accept new papers, giving the "forbidden access" message. There was no sign on the archive pages that this is the case, but this should have been changed by now.

The Internet is full of old files that have not been removed and that show up every time you do the search. Ask for "neural archives" and you will get either a missing link or a 1993 list of archives which do not exist since years. A few people are responsible enough to remove the outdated files and to write the date of the last modification of their page. Below I have made a short review of the current situation - an in-depth reviews of some of these and of other useful projects will be welcomed.

Los Alamos e-Print Archive (http://arxiv.org/) has continuously served the physics and mathematics community since August 1991. The number of hosts connecting to this archives reaches 9000 per day (as of April 2000), excluding numerous mirror sites, and the number of new papers stored there exceeds 2500 per month. Current estimate of the total number of papers in this archive is about 130.000. It contains several sub-archives of interest to the neural network community:

  • The Computing Research Repository (CoRR) opened in September 1998 is sponsored by ACM. Papers in CoRR are classified in two ways: by subject area (each subject has a moderator) and by using the ACM classification scheme which covers all of computer science. Subjects relevant to computational intelligence include: artificial intelligence, computation and language, computer vision and pattern recognition, human-computer interaction, learning, multiagent systems, neural and evolutionary computation and robotics. The address is http://arxiv.org/archive/cs/intro.html
  • Nonlinear Sciences repository includes adaptation and self-organizing systems, cellular automata and chaotic dynamics.
  • Physics repository includes disordered systems and neural networks, data analysis, statistics, probability and Bayesian analysis.
  • Other interesting repositories in this archive, like cellular/molecular neuroscience, developmental and behavioral/systems neuroscience are not too popular yet, but are ready to receive more papers.

The e-Print Archive has many copies around the world and is a great service to the physics and mathematics community. Similar initiative has been started in cognitive sciences: the CogPrints archive (http://cogprints.soton.ac.uk/) will store papers in psychology, neuroscience, linguistics, artificial intelligence, robotics, vision, learning, speech, neural networks, philosophy of mind and language, behavioral ecology, sociobiology, behavior genetics, evolutionary theory, psychiatry, neurology, human genetics, brain imaging, anthropology and other social and mathematical sciences pertinent to the study of cognition. CogPrints archive received the Psychological Science Award for "contribution to psychology on the Internet" from the PsychologicalScience.net.

Although no comparable repository designed specifically for the neural network community seems to exist several initiatives are worth mentioning. They aim at indexing the Web resources, searching for papers placed in people's archives. Storing the links to papers at different sites has some disadvantages: links are sometimes changed or papers removed by system administrators, and it is hard to index them in a useful way. On the other hand it is much easier to index the Web than to create a central repository.

The Princeton NEC Research Institute team (Lee Giles, Steve Lawrence and Kurt Bollacker) created the CiteSeer, called now a Research Index (http://www.researchindex.com), a system for automatically creating digital libraries, with emphasis on citation matching, indexing and ranking. The links to the postscript or PDF papers are shown as a result of search and cached versions of articles are available directly from the system; "correct" option will retrieve the first page of the paper and show it in the browser. Since the last and the first names may not be correctly ordered the authors should check their papers manually and correct the database, especially if their names contain letters with accents - the system does not deal correctly with the French, Spanish or Polish names! On demand citations are shown in context in which they appeared. Research Index seems to be quite popular since the system reports high load even in the middle of the night (Princeton time). The "Computer Science Directory" shows a list of papers in different subject areas, ordered according to the number of citations, authority of their authors and their tutorial value.

Research Index is certainly the best bibliographical system at the moment. It does not give references to author's home pages (it would be rather hard to make it in an automatic way since the authors don't give their web pages in publications) but one can frequently guess them starting from the link to the paper. The HP-search service (Trier University, Germany, http://hpsearch.uni-trier.de/hp/ ) is the best chance to find personal home pages of computer scientists. In April 2000 more than 42000 entries were stored there.

The Collection of Computer Science Bibliographies has more than one million references, amounting to 660 MBytes of BibTeX entries! It is a meta-service, composed of about 1200 specialized bibliographies and updated monthly from their original locations. About 90.000 references contain URLs to online versions of papers. There are more than 2000 links to other sites carrying bibliographic information, including large Computer Science Bibliography at Trier University, that stores the bibliographical information from major conferences, books and journals, but contains relatively little information related to neural networks. The address is http://liinwww.ira.uka.de/bibliography/index.html.

The ML Papers search engine (http://gubbio.cs.berkeley.edu/mlpapers/) created in 1997 to index the machine learning papers was probably the first system implementing automatic extraction of titles, authors and abstracts from postscript versions of papers. It has a simple interface for searching and shows titles authors, abstracts and links to postscript papers (this seems to be the only format indexed). Almost 1300 papers with "neural" keyword were found in April 2000. Adding new papers requires only giving an URL to FTP or HTTP archive.

An interesting approach to the problem of automatic indexing has been taken by the Just Research company, creators of Cora (http://www.cora.justresearch.com/), a special-purpose search engine covering over 50,000 research papers found in about one hundred computer science departments. Computer science has been divided into 10 categories, with "Artificial Intelligence" subjects involving data mining and machine learning subjects, and in the "Machine Learning" section neural networks are found among case-based learning, genetic algorithms, probabilistic methods, reinforcement learning, rule learning and theory. There are 75 computer science categories in this database. Cora allows to search for keywords found in papers that are stored in the postscript format. Results are used to provide automatically generated BibTeX entries (sometimes with errors) and include title, authors, abstract of the paper, main page address where the paper has been found, alternative addresses where it is stored, a list of references extracted from the paper and backward references (papers refereeing to the current paper).

The process of adding new files is very simple, requiring just a submission of the URL of the archive. 50 highest ranking papers are automatically displayed in each category, with rank determined by analysis of citations allowing for an automatic identification of papers as survey articles, seminal articles, papers with largest number of references and papers written by authoritative authors. Reinforcement learning and probabilistic techniques have been used for automatic classification. The Cora project is led by Andrew McCallum and although it is still in the research phase it is quite useful.

The New Zealand Digital Library (NZDL) project (http://www.nzdl.org/) has created publicly available search engines for domains from computer science technical reports to music videos. The emphasis of this project is on the creation of full-text searchable digital libraries; it does not use machine learning technology to automate the creation of search engines. The NZDL collection provides a huge index to over 45.000 computer science reports: whole papers were analyzed (over 30 Gbytes of postscript files, 1.3 million pages), so any fragment of text may be found. The collection also contains almost 30.000 figures extracted from the reports.

The Networked Computer Science Technical Reference Library (NCSTRL) aims at creation of "premier international on-line library of computer science technical reports". It provides software allowing institutions (over 160 at the moment, including leading US universities) to create digital libraries which are than integrated into the NCSTRL services. A simpler solution is also possible, with FTP archive servers, requiring contribution of bibliographical information to the central NCSTRL index (http://cs-tr.cs.cornell.edu/). Since the indexing is not automatic the project will probably stay more popular with the libraries (it is run by Cornell University Digital Library Research Group) than with computer scientists.

Experimental methods of searching for the relevant information based on the self-organized mappings exist (for example WebSOM, http://websom.hut.fi/websom/) but so far they are not useful for large-scale searches.

Finally the IEEE Bibliographies Online (for IEEE members only) provides bibliographical information on IEEE-sponsored conferences and journals, but no links to the actual paper (this is available to subscribers via the Opera services). IEEE bibliographies are at http://www.biblio.ieee.org/scripts/biblio_home.html.

Links to all services mentioned here (and to many others) are stored at: http://www.phys.uni.torun.pl/~duch/neural.html#biblio.


Bulletin full texts - EBIB No.1/2001 [Electronic document] . - Access mode: http://ebib.oss.wroc.pl/english/a1.php
Page editor: Anna Filipowicz (ankaf@bn.org.pl) Biblioteka Narodowa
Last modification: 24.01.2001