DIY
corpora: the WWW and the translator
Federico
Zanettin
Abstract
The WWW is the single largest existing
repository of electronic texts, and has recently attracted the attention of
researchers involved in translator training as a suitable source of texts for
the creation of "disposable corpora". These are small, specialized
corpora created ad-hoc to serve the needs of the translator for a specific
translation project, and their value lies not only in their analysis but even
more so in their creation. This
approach complements a number of studies which have been carried out on the use
of small corpora for language learning and translator training, where the main
focus is on methods and techniques for analysing texts already collected by the
teacher. This paper presents an experiment which was carried out at the School
for Translators and Interpreters of the University of Bologna in Forlì with
third and fourth year translation students in the context of a course on computer
assisted tools. Students were given a text to translate and asked to search the
Internet, select suitable web pages in the target language, and download them
on disk. In this way, while cyclically performing the translation and adding
material to the corpus as the translation proceeds, they were able to
familiarize themselves with the topic of the translation at hand, to select
texts according to text type, to assess the reliability of text sources and
evaluate the perspective readership. These DIY corpora were then browsed
switching between a full text mode and a concordancing, and learners were able
to tackle many translation problems related to specific terminology and
phraseology..
1 Introduction
Traditionally, translators have used
"parallel texts", i.e. collections of printed texts produced in
similar communicative situations, as a way of checking text-typological
conventions in the source and target languages. In the last few years
information technology has brought about a completely new scenario. The
availability of vast quantities of texts in many languages and on all kinds of
subjects is a dream come true for translators as well as for all types of
discourse professional, text processors and language services providers.
The WWW is the largest existing
repository of texts. The number of publicly accessible web pages has reached 2
to 4 billions (Fletcher 2001). The vast majority of these are in English (about 80% according to estimates), but the
number of users whose first language is not English is increasing (Fletcher
2001), and the pages in languages other than English are increasing at a faster
pace than pages in English. Table 1 gives an estimation of the growth of the
main European languages between October 1996 and February 2000, sorted according
to their rate of growth (which is shown in the last column).
|
|
Millions
of words (October
1996) |
Millions
of words (February
2000) |
Growth
factor |
|
Spanish |
104 |
1,894 |
18 |
|
German |
229 |
3,333 |
15 |
|
French |
223 |
2,732 |
12 |
|
Italian |
124 |
1,338 |
11 |
|
Portuguese |
106 |
1,161 |
11 |
|
Norwegian |
106 |
947 |
9 |
|
Finnish |
21 |
166 |
8 |
|
English |
6, 082 |
48, 064 |
8 |
Table 1: Growth of languages on the
Web (data from Grefenstette and Nioche, 2000)
For instance, while the number of
words in Italian available on the web in February 2000 was eleven times larger
that of October 1996, the number of words in English increased "only"
by eight times.
With availability comes ease of
access: A few clicks of the mouse are often worth several trips to the library
or consultations with clients and colleagues. There is also the added bonus
that electronic texts can be analysed through corpus linguistics techniques
rather than just read sequentially, thereby uncovering linguistic information
which would be otherwise very difficult to obtain.
Recent research in translation
studies has stressed the contribution which corpora of electronic texts can
bring to translators. By using appropriate software translators can look up
words in a matter of seconds, and highlight patterns by sorting contexts around
search words. If a corpus is appropriately designed, it can provide reliable
evidence of authentic linguistic behaviour and text-structuring conventions by
highlighting recurrent patterns. Terminological and collocational information
can be especially useful.
Experiments in the literature have
reported the uses of bilingual corpora and monolingual corpora (Zanettin 1998,
2001; Bowker 1998, 2000; Pearson 1998; Gavioli
and Zanettin 2000) as sources to compile term banks, and as aids during
the translation task. One problem with these typically small and domain
specific corpora is the limited range of topics and text types for which they
are available. Recent work has concentrated on increasing this range and
availability, and a number of well crafted bi- or multilingual corpora,
comparable and/or parallel, such as the English Norwegian Parallel Corpus,
the COMPARA
corpus and the CEXI corpus are
complete or under way. Some work has also been done using large monolingual
corpora such as the BNC (Stewart 2000,
Bernardini 2001). These resources will
certainly prove to be very useful. However, even the 100 million words BNC is
ill-equipped to meet the needs of translators working with very specialised
texts and confronted with specific terminology.
But now - for this - translators can
turn to the web.
There are two sets of problems
related to the use of the WWW documents as corpus material. The first concerns
procedures for assessing relevance and reliability: Information is dispersed in
the WWW through vast quantities of documents, and it is thus crucial for the
translator to retrieve this information in the most efficient and effective
way.
The second relates to strategies and techniques for searching electronic
texts: Search engines provide access points to Internet documents either
through lists generated by full text searches or by pre-selected lists
organized by topic, and are thus catalogues rather than corpora.
The
WWW is not a corpus, but it can be used as a corpus. Some search engines
(e.g. Google, Copernic) display, next to document
pointers (hyperlinks), a concordance-like context with the search word(s)
highlighted. Some applications which, building on search engines, are designed
more specifically for the needs of language professionals are also available on
the web. WebCorp (Kilgarriff 2001) and
KWICFinder (Fletcher 2001), for
instance, download and produce keyword in context abstracts of web pages which
match the user's search criteria. KWICFinder
permits more targeted searches by using a number of restricting criteria and
allows for the display of the output in a number of formats. These enhancements
go a long way towards solving the problem of using the web as a suitable source
of text-linguistic information, but they still do not solve the problems of the
relevance and the reliability of the document abstracts retrieved. The Internet
is full of a large number of ephemeral texts of dubious authorship and
authority, and the relevance criteria of search engines are different from
those of translators of specific texts (Fletcher 2001).
In this paper I would like to take
another approach, which has been already explored by a number of researchers
and trainers (e.g. Pearson 1998, Maia 2000, Varantola 2000, Bertaccini and
Aston 2001) and look at the web as a source of texts for a DIY corpus.
A DIY web corpus can be
characterized as follows:
-
it is
a collections of Internet documents, or more precisely of web pages in HTML.
-
it is
created ad hoc as a response to a specific text to be translated
-
it is
an open corpus. More material can be added as the need arises
-
it is
disposable (Varantola 2000) or virtual (Ahmad et al.1994). It is not destined
to be part of a more permanent corpus, and can be disposed of as soon as the
translation is completed. Copyright permissions are not required
-
like
"parallel texts" it can be either bilingual comparable or target
monolingual.
In the following sections I report
on an experiment with DIY corpora at the school for translators and
interpreters of the University of Bologna at Forlì with a group of advanced
translation students.
2 The experiment
This experiment was carried out
within a course of CAT tools, which comprised a number of different modules and
was designed as a general introduction to computer assisted translation,
providing an overview of existing technologies available to professional
translators such as terminology management systems and translation memory
tools, and of resources such as online term banks, machine translation
programmes, mailing lists for translators, etc. One module, of which this
experiment was a part, was on "corpus management in translation",
defined by Varantola (forthcoming) as "the knowledge and skills needed in
the compilation and use of corpus information for individual translation
assignments".
At the time of the experiment, which
took place over two weeks in five two-hour sessions, the students had already
been exposed to the use of various professional tools and online resources, and
while not all of them were skilled Internet users, only one of them was a
novice. They were also already familiar with the main features of WordSmith Tools (Scott 1996), the
corpus analysis programme used in the experiment.
In preparation for the task students
were given a brief survey of the tricks and treats of Internet Explorer and WordSmith Tools, and a task sheet was
distributed with operational instructions and a set of quick reference
guidelines. As regards the browser, for instance, students were instructed to
take advantage of the "chronology" feature, which lists all visited
documents in a side window together with their title and address and allows for
off-line browsing. They were told to open documents in multiple windows and save
all relevant web pages in individual "corpus folders" on the hard
disk of their computers, in order to use them as a corpus to be analysed with
the concordancer. They were also given a list of some search engines (some
international, e.g. Google, Altavista, Yahoo,
and some Italian, e.g. Virgilio, Arianna, Kataweb).
Those who translated into/from
German, French or Spanish found language/area specific search engines.
Students where asked to engage in a
real translation task, using a DIY corpus as a resource to help them translate
either a text which they had as an assignment from another course or - for some
of them - as a paid translation job. No restrictions were given as to source
and target languages. Two additional texts (one in English, one in Italian)
were provided for those who didn't have a better text handy.
The following are some of the texts
which students translated
-
an
encyclopaedia article on prostate cancer, from English into Italian
-
(part
of) a textiles catalogue (web page), from Italian into Spanish
-
(part
of) a bicycle locks catalogue (web page), from Italian into English
-
an
article on earthquakes from a science magazine, from Italian into English
-
a
promotional leaflet on diamonds, from English into Italian
Students were encouraged to
translate their text using a number of tools they were already familiar with,
such as online terminology banks (e.g. Eurodicautom, Logos), translator workbenches (e.g. Trados, Déjà Vu,
WordFisher), electronic dictionaries
(e.g. Babylon), etc. After setting up their workstation
(opening the text editor/workbench, Internet browser and WordSmith Tools), they read the source
text and started their translation. They also began to search the Web for
suitable texts to include in their corpus and help them in the specific
translation task. Some students chose to work alone, others worked in pairs or
groups of three. The teacher acted as a facilitator, helping to solve problems.
At the end of the experiment the students were asked to write a brief report on
the benefits and shortcomings of creating and using a DIY web corpus as a
translation resource.
A first step usually consisted in
trying to get a better understanding of the source text. To this end, some
students focused on unknown terminology, using paper and electronic
dictionaries, term banks and other online resources. Online glossaries, usually
found by searching for the word "glossary" along with words
identifying the topic (e.g. "diamonds" or "prostate cancer")
were reported to be especially useful. By checking for equivalents in source
and target language glossaries students were able to identify some key terms to
be used as key words for searching for relevant corpus candidates. Other
students started by looking at "Internet directories", i.e. lists of
web sites organized into categories which are provided by "portals".
A combination of the two techniques with the use of multiple search engines
seemed the most useful strategy.
Students were free to choose the
type and number of texts to download from the Internet and include in their
corpus. They were advised to open all candidate documents, scanning them for
content rather than for specific linguistic items, and save relevant pages in
their corpus folder. After having found and saved a first group of texts,
students used WordSmith Tools to analyse the corpus while proceeding with the
translation. As the translation proceeded they added more material, refining
searches according to their needs. The size of their corpora eventually varied
between 10 and 50 texts, or 5,000 to 40,000 words.
The relevance of a document was
usually decided after scanning the text for both verbal and visual clues, such
as titles, layout and images. Decisions were taken at the level of overall
content, text type, style and register. For instance, the student translating a
catalogue of luxury textiles discarded a number of texts from the web site of a
museum dealing with African traditional textiles and clothing; the student translating a bicycle locks catalogue
discarded a newspaper article on urban safety.
As for reliability, students
discarded bad translations (into which they sometimes ran by using tentative
translation equivalents as search terms) and privileged texts produced by
recognizable entities and authorities within the relevant discourse communities
(experts, producers, public and private agencies and associations).
Finding useful web pages in the
target language was also an exercise in audience design, giving a change to
students to form an idea of the potential perspective readership of their
translations.
Having created their corpus, they
still had the problem of how to find the information they were looking for, but
they could use the knowledge derived from their prior acquaintance of the texts
to conduct searches around known equivalents or come up with informed
hypotheses.
The corpus was mainly used for
finding information on terminology, phraseology and collocations. For instance,
after having established while constructing the corpus that an "antifurto
per biciclette" is a "bicycle lock", it was easy for a student
to get a list of different types of locks (cable lock, coiling cable lock,
U-lock, disc lock, etc.) by sorting the output of a concordance for "lock"
according to the first word to the left. The student translating a scientific
report on earthquakes learned that while in Italian both walls and buildings
"crollano" in English buildings usually "collapse" while
the word "wall/s" collocates more frequently with "fall/s".
When looking for a translation for "cedimenti strutturali gravi" one
student generated a concordance of "structural*" and quickly found
that she could use the phrase "heavy structural damage".
When they were uncertain or
presented with multiple translation candidates, students relied on frequency of
occurrence as and indicator of reliability, stressing that since the texts in
the corpus were carefully selected, it was unlikely that they would produce
spurious examples.
Some students resorted to
concordancing mainly while revising the translation. That is, they first wrote
a draft of the translation while finding parallel Internet texts, then went
through their text checking their hypotheses and intuition against the corpus
with WordSmith Tools.
3 Benefits and problems
In their reports, many students
noted the advantage of a corpus of electronic texts over more traditional
reference material. While in paper dictionaries the information is usually
buried in small-type heavy columns, web pages often contain images and other
multimedia features which aid understanding. They stated that constructing the
corpus was as useful as generating concordances from it, and that they often
went back to view a web page in full after looking at concordance lines.
However, in line with similar
observations made by both trainee and professional translators who participated
in similar experiments (Varantola forthcoming, Jääskeläinen and Mauranen 2000),
students complained about the lack of user-friendliness of WordSmith Tools.
Despite its many capabilities, the current version of WordSmith Tools is still
not fully equipped to work with tagged texts written in HTML/XML and, while it
is possible to exclude tags from view, it is not possible to jump from a
concordance line to the corresponding web page. To take advantage of important
information from layout and images, students had thus to switch between a
concordance in WordSmith Tools and the corresponding web page, with each file
having to be located by its name and opened in the browser.
One group of students spent much
time inspecting web pages for specific terminology rather than looking for
suitable texts. In this respects, they used the web itself as corpus rather
than as a source for creating a corpus. They spent more time reading individual
documents and looking for exact equivalents rather than deciding whether a text
was likely to contain useful terminology and save it for later inspection with
the concordancer. They also felt frustrated when they later found that, having
saved too few documents, their corpus was not large enough to be of use.
Students noted that searching for
web pages, creating the corpus and analysing it with the concordancer was
time-consuming, and argued that the translation task would have taken less time
if done with dictionaries alone. But they also stated that they felt more
confident about the solutions adopted, especially in translating into the
foreign language, and that the balance between costs and benefits would be
different with longer translation assignments.
Other observations by students
concerned the use of the web as a corpus resource: while some believed it would
be mostly useful when translating into a foreign language, others chose to
translate into their first language. While all students created a target
language corpus not all of them created a source language one. [1]
4 Conclusions
DIY corpora are one of a number of
different types of corpus resources which translators can use in their work. [2]
Research on corpus use in translator
training environments generally takes a bottom-up approach, which could be
termed "from words to texts". This approach is mostly concerned with
finding appropriate ways of analysing corpus resources provided by the teacher,
be they large monololingual corpora or smaller mono or bilingual corpora,
created ad hoc either from electronic text archives on CD-ROM or from printed
sources.
This approach has been complemented
by a top-down approach, i.e. one going "from texts to words", which
assumes no pre-existing corpus to be analysed, and which has been made possible
thanks to the availability of large quantities of texts on the Internet. By
compiling their DIY corpus prior, during or even after the translation task
(Aston 2000), students (and translators) can get a first acquaintance with
texts, and take full advantage of web pages prior to word prompted
analysis.
Hopefully software producers and
developers will create professional applications in which the functions of
browser and concordancer will be better integrated, and DIY will find their
place in the translator's workstation together with other corpus resources and
computer assisted tools.
Notes
[1] An assessment of the of the
quality of the assignments was outside the scope of the experiment. However,
better translations seemed to have been produced by those students who adopted
successful strategies in the creation and analysis of the corpus.
[2] Another type of corpus resources
are translation memories, which are a very specialized kind of parallel corpus,
and are usually relevant, reliable and well integrated into the translation
work-flow. But of course translators do not have a translation memory ready for
all occasions.
References