COMP9315 25T1: Assignment 1
Aims
This assignment aims to give you
• an understanding of how data is treated inside a DBMS
• practice in adding a new base type to PostgreSQL
The goal is to implement a new data type for PostgreSQL, complete with input/output functions, comparison operators and the ability to build indexes on values of the type.
Summary
Deadline: Friday 20:59:59 21st March (Sydney Time).
Pre-requisites: Before starting this assignment, it would be useful to complete Prac P04.
Late Penalty: 5% of the max assessment mark per-day reduction, for up to 5 days.
Marks: This assignment contributes 15 marks toward your total mark for this course.
Submission: Moodle > Assignment > Assignment 1.
Note: Make sure that you read this assignment specification carefully and completely before starting work on the assignment. Questions which indicate that you haven't done this will simply get the response "Please read the spec".
We use the following names in the discussion below:
• PG_CODE
|
... the directory where your PostgreSQL source code is located
|
|
(typically /localstorage/YOU/postgresql-15.11/)
|
• PG_HOME
|
... the directory where you have installed the PostgreSQL binaries
|
|
(typically /localstorage/YOU/pgsql/bin/)
|
• PG_DATA
|
... the directory where you have placed PostgreSQL's data
|
|
(typically /localstorage/YOU/pgsql/data/)
|
• PG_LOG
|
... the file where you send PostgreSQL's log output
|
|
(typically /localstorage/YOU/pgsql/data/log/)
|
|
Introduction
PostgreSQL has an extensibility model which, among other things, provides a well - defined process for adding new data types into a PostgreSQL server. This capability has led to the development by PostgreSQL users of a number of types (such as polygons) which have become part of the standard distribution. It also means that PostgreSQL is the database of choice in research projects which aim to push the boundaries of what kind of data a DBMS can manage.
In this assignment, we will be adding a new data type for dealing with PostAddress. You may implement the functions for the data type in any way you like provided that they satisfy the semantics given below (in the Post Address Data Type section).
Adding Data Types in PostgreSQL. The process for adding new base data types in PostgreSQL is described in the following sections of the PostgreSQL documentation:
• 38.13 User-defined Types
• 38.10 C-Language Functions
• 38.14 User-defined Operators
• SQL: CREATE TYPE
• SQL: CREATE OPERATOR
• SQL: CREATE OPERATOR CLASS
Section 38.13 uses an example of a complex number type, which you can use as a starting point for defining your PostAddress data type (see below). There are other examples of new data types under the directories:
• PG_CODE/contrib/chkpass/ ... an auto-encrypted password datatype
• PG_CODE/contrib/citext/ ... a case-insensitive character string datatype
• PG_CODE/contrib/seg/ ... a confidence-interval datatype
|
These may or may not give you some useful ideas on how to implement the PostAddress data type. For example, many of these data types are fixed-size, while PostAddress are variable-sized. A potentially useful example of implementing variable-sized types can be found in:
• PG_CODE/src/tutorial/funcs.c ... implementation of several data types
|
Setting Up
You ought to start this assignment with a fresh copy of PostgreSQL, without any changes that you might have made for the Prac exercises (unless these changes are trivial). Note that you only need to configure, compile and install your PostgreSQL server once for this assignment. All subsequent compilation takes place in the src/tutorial directory, and only requires modification of the files there.
Once you have re-installed your PostgreSQL server, you should run the following commands:
$ cd PG_CODE/src/tutorial
$ cp complex.c postadd.c
$ cp complex.source postadd.source
|
Once you've made the postadd.* files, you should also edit the Makefile in this directory and add the green text to the following lines:
MODULES = complex funcs postadd
DATA_built = advanced.sql basics.sql complex.sql funcs.sql syscat.sql postadd.sql
|
The rest of the work for this assignment involves editing only the postadd.c and postadd.source files. In order for the Makefile to work properly, you must use the identifier _OBJWD_ in the postadd.source file to refer to the directory holding the compiled library. You should never modify directly the postadd.sql file produced by the Makefile. Place all of your C code in the postadd.c file; do not create any other *.c files.
Note that your submitted versions of postadd.c and postadd.source should not contain any references to the complex type. Make sure that the documentation (comments in program) describes the code that you wrote.
Postal Address Data Type
We wish to define a new base type PostAddress to represent postal system. We also aim to define a useful set of operations on values of type PostAddress and wish to be able to create indexes on PostAddress attributes. How you represent PostAddress values internally, and how you implement the functions to manipulate them internally, is up to you. However, they must satisfy the requirements below.
Once implemented correctly, you should be able to use your PostgreSQL server to build the following kind of SQL applications:
create table Addresses (
id integer primary key,
address PostAddress not null, -- etc. etc.
);
insert into Addresses(id, address) values
(1,'U19/36 Queen Ave, Southgate, AR 7279'), (2,'U3/100 Victoria Ave, Lakeside, AU 5135');
|
Having defined a hash-based file structure, we would expect that the queries would make use of it. You can check this by adding the keyword EXPLAIN before the query, e.g.
db=# create index on Addresses using hash (address);
db=# explain analyze select * from Addresses where address='U19/36 Queen Ave, Southgate, AR 7279';
|
which should, once you have correctly implemented the data type and loaded sufficient data, show that an index-based scan of the data is being used.
Postal Address Values
Valid PostAddress values is defined as following:
• A PostAddress has 4 parts: DetailedUnitRoad, Suburb, State and Postcode
• The DetailedUnitRoad must include a street name and may include a unit number.
• The unit number includes one letter and digits.
• The street name includes digits and word list.
• The unit number and street name in DetailedUnitRoad are separated by '/'.
• The suburb must consist of one word.
• The state must be a two-letter uppercase abbreviation.
• The postcode must be a four-digit number.
• There will be no non-numeric postcodes.
A more precise definition can be given using a BNF grammar:
PostAddress ::= DetailedUnitRoad ', ' Suburb ', ' State ' ' Postcode
DetailedUnitRoad ::= Street | Unit '/' Street
Unit ::= Letter Digits
Street ::= Digits ' ' WordList
Suburb ::= WordList
State ::= Upper Upper
Postcode ::= Digit Digit Digit Digit
WordList ::= Word | Word ' ' WordList
Word ::= Letter | Letter Word
Letter ::= 'a' | 'b' | ... | 'z' | 'A' | 'B' | ... 'Z'
Upper ::= 'A' | 'B' | ... | 'Z'
Digit ::= '0' | '1' | ... | '9'
|
Under this syntax, the following are valid postal addresses:
U19/36 Queen Ave, Southgate, AR 7279
U3/100 Victoria Ave, Lakeside, AU 5135
B33/240 Emerald Forest HWY, Hillsborough, NO 5865 240 Emerald Forest HWY, Hillsborough, NO 5865
|
The following addresses are not valid in our system:
U19/ 36 Queen Ave, Southgate, AR 7279
|
# Having space before or after ‘/’
|
U19/Queen Ave, Southgate, AR 7279
|
# Missing street number
|
U19/36 Queen Ave, Southgate, A 7279
|
# Incorrect state abbreviation
|
240 Emerald Forest HWY, Hillsborough, N3 5865 U19/36 Queen Ave, Southgate, AR7279
# Missing space between state abbreviation and postcode
U19/36 Queen Ave, Southgate, AR 72A9 # Non-numeric postcode
19/36 Queen Ave, Southgate, AR 7279 # Unit format error
240 Emerald Forest HWY Hillsborough NO 5865
240 Emerald Forest HWY, Hillsborough, no 5865
|
Important: for this assignment, we define an ordering on addresses as follows:
• The ordering is determined initially by the state name.
• If the state names are equal, then the ordering is determined by the suburb name.
• If the suburb names are equal, then the ordering is determined by the street.
• If street names are equal, the ordering is determined by the unit number.
• Ordering of parts is determined lexically and case-insensitive.
Representing Postal Addresses
The first thing you need to do is to decide on an internal representation for your PostAddress data type. You should do this, however, after you have looked at the description of the operators below, since what they require may affect how you decide to structure your internal PostAddress values.
When you read strings representing PostAddress values, they are converted into your internal form, stored in the database in this form, and operations on PostAddress values are carried out using this data structure. It is useful to define a canonical form for addresses, which may be slightly different from the form in which they are read (e.g. "U19/36 Queen Ave, Southgate, AR 7279" might be rendered as " U19/36 Queen Ave,Southgate,ar 7279"). When you display PostAddress values, you should display them exactly in the same way they are input, regardless of how they are stored.
The first functions you need to write are ones to read and display values of type PostAddress. You should write analogues ofthe functions complex_in(), complex_out() that are defined in the file complex.c. Make sure that you use the V1 style. function interface (as is done in complex.c).
Note that the two input/output functions should be complementary, meaning that any string displayed by the output function must be able to be read using the input function. There is no requirement for you to retain the precise string that was used for input (e.g. you could store the PostAddress value internally in a different form such as splitting it into several parts). One thing that postadd_in() must do is determine whether the input string has the correct structure (according to the grammar above). Your postadd_out() should display each postal address in a format that can be read by postadd_in().
You are not required (but you can) define binary input/output functions, called receive_function and send_function in the PostgreSQL documentation, and called complex_send and complex_recv in the complex.c file. As noted above, you cannot assume anything about the input length of the postal addresses. Using a fixed-size representation for PostAddress limits your maximum possible mark to 10/15.