Duch, Wodzislaw: Neural networks papers on the Internet
Nicholas Copernicus University, Torun, Poland
It's no secret that the new generation of students and young researchers
tries to avoid going to the library by searching for everything in the
Internet. For them information that cannot be found in the Internet does
not exist. Last year I've got an email from a USA high-school student
asking "who was this Copernicus guy? I have to write a paper and can't
find anything ...".
For the new generation Internet becomes not just the source of
information, but the only source of information. Young researchers
first
check the Internet pages of journals to find relevant papers and only then
- if they are lucky and have a good library subscribing to the particular
journal - go to the library. Unfortunately good libraries are hard to find
even in the USA and Western Europe, not to mention the Central European
and developing countries. Quite a few people (myself included) moved from
engineering, physics or mathematics to neural computing. The fact that
they could collect many useful papers downloading them from Internet
archives greatly facilitated this movement. Yet many "serious" scientist
pay little attention to this new media, spending a significant amount of
time on preparation of their articles and conference presentations and
devoting virtually no time to maintain archives with their papers and web
pages describing their projects. There are various prizes for the best
papers but no prizes for the most useful WWW sites, hence little
motivation to spend any time on developing and maintaining the sites.
Some conferences publish thousands of pages in thick volumes - who can
read it all? It is faster to find the relevant papers in the Internet and
print only those that one wants. Neural network societies (INNS, ENNS,
APNNA, IEEE NNC), journals and large conferences still continue their
business as usual: the WWW addresses of participants are almost never
published, frequently even email addresses are not published, no
information about useful WWW sites is ever mentioned, conference home
pages rarely keep a table of contents of their proceedings. There are some
signs of changes, at least in respect to IEEE conferences: a great new
service, called IEEE Xplore (it has been mentioned in the editorial
article of the IEEE TNN in January 2000) allows everyone to access tables
of contents of IEEE Transactions, journals, magazines, conference
proceedings and standards. IEEE members have also full access to articles
and other material in the PDF format. These publications should be
cross-linked using hyperlinks. This is a welcome extension to the IEEE
bibliographies on-line and to the Opera services. The IEEE Xplore
may be
found at http://ieeexplore.ieee.org.
Although many experts maintain their local archives without a
well-organized, central repository of catalogued papers they are not only
hard to find, but also they do not have strong motivation to place their
papers in local archive. Without the support of neural network community
many useful initiatives to collect papers run out of steam and die. For
example, the Neuroprose archive of papers, which started in 1989 and
accumulated more than 600 papers and 53 PhD thesis has never been properly
organized. The papers were automatically dropped to a huge catalog that
finally grew to a size that prevented finding anything interesting unless
one new the name of the file. The readme file was modified for the
last
time in 1994 and the archive itself seems to have died in 1998. Neuroprose
was based on the ftp protocol which is out of date and frequently
does not
work properly due to various firewalls set up for security reasons.
Uploading the files to one big catalog is certainly not a good long-term
solution. There is no reason why the archives should not be accessible by
the http protocol through the Web pages. The London and South-East
center
for High Performance Computing (SEL-HPC) has opened an archive storing
papers related to parallel and functional programming, vision and image
processing, computational mathematics, neural networks, human computer
interaction and other subjects. The archive accumulated over 7000 papers
and stored also the links to the home pages of people that kept their
papers there. Unfortunately in 1998 the SEL-HPC center was shut down and
although the archive is still there it does not accept new papers, giving
the "forbidden access" message. There was no sign on the archive pages
that this is the case, but this should have been changed by now.
The Internet is full of old files that have not been removed and that show
up every time you do the search. Ask for "neural archives" and you will
get either a missing link or a 1993 list of archives which do not exist
since years. A few people are responsible enough to remove the outdated
files and to write the date of the last modification of their page. Below
I have made a short review of the current situation - an in-depth reviews
of some of these and of other useful projects will be welcomed.
Los Alamos e-Print Archive (http://arxiv.org/) has continuously served the
physics and mathematics community since August 1991. The number of hosts
connecting to this archives reaches 9000 per day (as of April 2000),
excluding numerous mirror sites, and the number of new papers stored there
exceeds 2500 per month. Current estimate of the total number of papers in
this archive is about 130.000. It contains several sub-archives of
interest to the neural network community:
- The Computing Research Repository (CoRR) opened in September
1998 is
sponsored by ACM. Papers in CoRR are classified in two ways: by subject
area (each subject has a moderator) and by using the ACM classification
scheme which covers all of computer science. Subjects relevant to
computational intelligence include: artificial intelligence, computation
and language, computer vision and pattern recognition, human-computer
interaction, learning, multiagent systems, neural and evolutionary
computation and robotics. The address is
http://arxiv.org/archive/cs/intro.html
- Nonlinear Sciences repository includes adaptation and
self-organizing
systems, cellular automata and chaotic dynamics.
- Physics repository includes disordered systems and neural
networks, data
analysis, statistics, probability and Bayesian analysis.
- Other interesting repositories in this archive, like
cellular/molecular
neuroscience, developmental and behavioral/systems neuroscience are
not
too popular yet, but are ready to receive more papers.
The e-Print Archive has many copies around the world and is a great
service to the physics and mathematics community. Similar initiative has
been started in cognitive sciences: the CogPrints archive
(http://cogprints.soton.ac.uk/)
will store papers in psychology,
neuroscience, linguistics, artificial intelligence, robotics, vision,
learning, speech, neural networks, philosophy of mind and language,
behavioral ecology, sociobiology, behavior genetics, evolutionary theory,
psychiatry, neurology, human genetics, brain imaging, anthropology and
other social and mathematical sciences pertinent to the study of
cognition. CogPrints archive received the Psychological Science
Award for
"contribution to psychology on the Internet" from the
PsychologicalScience.net.
Although no comparable repository designed specifically for the neural
network community seems to exist several initiatives are worth mentioning.
They aim at indexing the Web resources, searching for papers placed in
people's archives. Storing the links to papers at different sites has some
disadvantages: links are sometimes changed or papers removed by system
administrators, and it is hard to index them in a useful way. On the other
hand it is much easier to index the Web than to create a central
repository.
The Princeton NEC Research Institute team (Lee Giles, Steve Lawrence and
Kurt Bollacker) created the CiteSeer, called now a Research
Index
(http://www.researchindex.com),
a system for automatically creating
digital libraries, with emphasis on citation matching, indexing and
ranking. The links to the postscript or PDF papers are shown as a result
of search and cached versions of articles are available directly from the
system; "correct" option will retrieve the first page of the paper and
show it in the browser. Since the last and the first names may not be
correctly ordered the authors should check their papers manually and
correct the database, especially if their names contain letters with
accents - the system does not deal correctly with the French, Spanish or
Polish names! On demand citations are shown in context in which they
appeared. Research Index seems to be quite popular since the system
reports high load even in the middle of the night (Princeton time). The
"Computer Science Directory" shows a list of papers in different subject
areas, ordered according to the number of citations, authority of their
authors and their tutorial value.
Research Index is certainly the best bibliographical system at the
moment.
It does not give references to author's home pages (it would be rather
hard to make it in an automatic way since the authors don't give their web
pages in publications) but one can frequently guess them starting from the
link to the paper. The HP-search service (Trier University,
Germany,
http://hpsearch.uni-trier.de/hp/
)
is the best chance to find personal home
pages of computer scientists. In April 2000 more than 42000 entries were
stored there.
The Collection of Computer Science Bibliographies has more
than one
million references, amounting to 660 MBytes of BibTeX entries! It is a
meta-service, composed of about 1200 specialized bibliographies and
updated monthly from their original locations. About 90.000 references
contain URLs to online versions of papers. There are more than 2000 links
to other sites carrying bibliographic information, including large
Computer Science Bibliography at Trier University, that stores the
bibliographical information from major conferences, books and journals,
but contains relatively little information related to neural networks. The
address is
http://liinwww.ira.uka.de/bibliography/index.html.
The ML Papers search engine
(http://gubbio.cs.berkeley.edu/mlpapers/)
created in 1997 to index the machine learning papers was probably the
first system implementing automatic extraction of titles, authors and
abstracts from postscript versions of papers. It has a simple interface
for searching and shows titles authors, abstracts and links to postscript
papers (this seems to be the only format indexed). Almost 1300 papers with
"neural" keyword were found in April 2000. Adding new papers requires only
giving an URL to FTP or HTTP archive.
An interesting approach to the problem of automatic indexing has been
taken by the Just Research company, creators of Cora
(http://www.cora.justresearch.com/),
a special-purpose search engine
covering over 50,000 research papers found in about one hundred computer
science departments. Computer science has been divided into 10 categories,
with "Artificial Intelligence" subjects involving data mining and machine
learning subjects, and in the "Machine Learning" section neural networks
are found among case-based learning, genetic algorithms, probabilistic
methods, reinforcement learning, rule learning and theory. There are 75
computer science categories in this database. Cora allows to search for
keywords found in papers that are stored in the postscript format. Results
are used to provide automatically generated BibTeX entries (sometimes with
errors) and include title, authors, abstract of the paper, main page
address where the paper has been found, alternative addresses where it is
stored, a list of references extracted from the paper and backward
references (papers refereeing to the current paper).
The process of adding new files is very simple, requiring just a
submission of the URL of the archive. 50 highest ranking papers are
automatically displayed in each category, with rank determined by analysis
of citations allowing for an automatic identification of papers as survey
articles, seminal articles, papers with largest number of references and
papers written by authoritative authors. Reinforcement learning and
probabilistic techniques have been used for automatic classification. The
Cora project is led by Andrew McCallum and although it is still in
the
research phase it is quite useful.
The New Zealand Digital Library (NZDL) project
(http://www.nzdl.org/) has
created publicly available search engines for domains from computer
science technical reports to music videos. The emphasis of this project is
on the creation of full-text searchable digital libraries; it does not use
machine learning technology to automate the creation of search engines.
The NZDL collection provides a huge index to over 45.000 computer science
reports: whole papers were analyzed (over 30 Gbytes of postscript files,
1.3 million pages), so any fragment of text may be found. The collection
also contains almost 30.000 figures extracted from the reports.
The Networked Computer Science Technical Reference Library
(NCSTRL)
aims
at creation of "premier international on-line library of computer science
technical reports". It provides software allowing institutions (over 160
at the moment, including leading US universities) to create digital
libraries which are than integrated into the NCSTRL services. A simpler
solution is also possible, with FTP archive servers, requiring
contribution of bibliographical information to the central NCSTRL
index
(http://cs-tr.cs.cornell.edu/).
Since the indexing is not automatic the
project will probably stay more popular with the libraries (it is run by
Cornell University Digital Library Research Group) than with computer
scientists.
Experimental methods of searching for the relevant information based on
the self-organized mappings exist (for example WebSOM,
http://websom.hut.fi/websom/)
but so far they are not useful for
large-scale searches.
Finally the IEEE Bibliographies Online (for IEEE members
only) provides
bibliographical information on IEEE-sponsored conferences and journals,
but no links to the actual paper (this is available to subscribers via the
Opera services). IEEE bibliographies are at
http://www.biblio.ieee.org/scripts/biblio_home.html.
Links to all services mentioned here (and to many others) are stored at:
http://www.phys.uni.torun.pl/~duch/neural.html#biblio.
|