You can find the course details here
Here is the outline for the course.
EGLUG Systems Development Course (ESDC)
Amr Ali
062810
Copyright Notice
Copyright (C) 2010 by Amr Ali
amr-ali.co.cc
All rights reserved.
This work is licensed under the Creative Commons
Attribution-Noncommercial-Share Alike 3.0 Unported License.
To view a copy of this license visit
or send a letter to Creative Commons, 171 Second Street,
Suite 300, San Francisco, California, 94105, USA.
Abstract
This course is only a premier to the system development field
and should not be treated as a reference or base a study of
sorts upon it. It only teaches the principles of system design
and development for developers that invested quite an effort into
programming and willing to go further into the realm and valleys of
the hacker.
Quotes
"Software Engineering might be science; but that's not what I do.
I'm a hacker, not an engineer." - Jamie Zawinski
"I decry the current tendency to seek patents on algorithms.
There are better ways to earn a living than to prevent others
from making use of one's contributions to computer science."
- Donald E. Knuth
"Science is what we understand well enough to explain to a computer.
Art is everything else we do." - Donald E. Knuth
"Always program as if the person who will be maintaining your program
is a violent psychopath that knows where you live." - Martin Golding
"Coding styles are like assholes, everyone has one and
no one likes anyone else's." - Eric Warmenhoven
Jokes
"Computers are like air conditioners: they stop working when you open
WINDOWS."
"unzip; strip; touch; finger; mount; fsck; more; yes; unmount; sleep"
- My daily UNIX command list.
"The Internet: where men are men, women are men,
and children are FBI agents."
"One of the main causes of the fall of the Roman Empire was that,
lacking zero, they had no way to indicate successful termination
of their C programs." - Robert Firth
"UNIX is user-friendly. It's just very selective about
who its friends are."
"Microsoft is not the answer,
Microsoft is the question, NO is the answer." - Erik Naggum
"I would love to change the world, but they won't give me
the source code" - Amr Ali? :-)
My Personal Favorite
"There are 10 kinds of people in this world, those that understand
trinary, those that don't, and those that confuse it with binary."
- Whoever understands that joke will do good in this course :-)
Table Of Contents
1. Introduction
1.1. Summary
1.2. Prerequisites
1.2.1. Programming Experience
1.2.2. Field Of Experience
1.2.3. Programming Languages
1.2.4. Depth Of System Knowledge
1.2.5. CPU Designs And Architectures
1.2.6. Mentality
2. UNIX Based Systems Communications
2.1. User Space To User Space
2.2. User Space To Kernel Space
2.3. Kernel Space To Kernel Space
2.4. InterProcess Communication
2.4.1. Types Of Communication
2.4.1.1. Signals
2.4.1.2. Pipes
2.4.1.3. Sockets
2.4.1.4. Message Queues
2.4.1.5. Semaphores
2.4.1.6. SpinLocks
2.4.1.7. Mutexes
2.4.1.8. Shared Memory
2.4.2. Synchronization
2.4.3. Common Problems
3. System Application Design
3.1. Strategy
3.2. Daemons
3.3. Logs
3.4. Storage
3.5. Debugging
4. Case Studies
4.1. Case 1
4.2. Case 2
4.3. Case 3
4.4. Final Project
5. Author
5.1. Background
5.2. Contact Information
6. Thanks
1. Introduction
1.1. Summary
ESDC is for whoever wants to develop system level applications and
solutions that are specifically designed for UNIX based systems.
This is an open ended course which entitles the expansion of the above
TOC at any time without prior notice to students.
1.2. Prerequisites
1.2.1. Programming Experience
Whoever applies to this course should had quite an experience with
different programming languages to have the mature mentality required
for this course. Basically it is a "MUST" that a student had at least
written one thousand (1000) line of code in any language.
1.2.2. Field Of Experience
It is not required that students had done any system development
prior to this course but it is rather preferred that they have
done at least readings on the topic or have a general idea what
the hell we are talking about.
1.2.3. Programming Languages
It is a "MUST" that a student is very fluent in C as a language
but not necessarily in its libraries, however general knowledge
of them is preferred and knowing how to use the man pages is
a must along with knowing the meaning of "RTFM".
ASM is not required at all, but knowing the different syntaxes
is preferred and maybe little to how ASM as language works.
BASH, I know I'm stating the obvious, but I don't either want to
be fronted with questions about administering UNIX or how to build
a Makefile, so BASH/M4 are best to be known.
1.2.4. Depth Of System Knowledge
This course is built around UNIX based systems, mainly Linux,
so it is a "MUST" to know your way around that system and no,
this is not "how to make windows drivers" course.
1.2.5. CPU Designs And Architectures
It is strongly preferred that a student knows about the different
CPU architectures and designs and how they contribute to system
development, that's why some ASM would be preferred in general.
Just know this, SMP is trouble, lock your code good, or keep debugging
day and night until you bleed out of your eye sockets.
1.2.6. Mentality
You should have the mentality of a hunter, you never quite, and
you never surrender to failure, you keep trying and trying.
Always alert to the tiniest details and ready to adapt new
tactics and techniques quickly, which requires dedication
and effort.
2. UNIX Based Systems Communications
2.1. User Space To User Space
First let me introduce you to what is user space,
user space is whatever made by mere humans and runs
in the background (ex. background process, or daemon)
and of course cannot communicate directly with the
physical layer of your computer.
User space communications mainly deals within the area
of IPC (InterProcess Communication), like shared memory
segments, message queues, and named/unnamed pipes.
Forget the ideas you had to have a certain file that
some process writes to and another reads from, I had these
ideas when I was 12, you are an adult use IPC.
Apparently main purposes of IPC is to make two processes
run totally independently of each other and communicate in
an efficient way to pass certain information to each other
back and forth.
2.2. User Space To Kernel Space
"Kernel: is the central component of most computer operating
systems; it is a bridge between applications and the actual
data processing done at the hardware level." - Wikipedia
So let me put this in more simple words, the kernel is basically
the guy that facilitates the usage of your hardware, bluntly,
without a kernel and you wanted to print out "hello." to your screen
you will have to write the code that communicates to your PCI bus and
to your video card, with all the pedantic necessary op codes and flags
to make this 6 characters word appear to your dead cold black screen.
But if we have our kernel how we communicate to it? can we include just
a couple of C header files and it will contain all the above mentioned
code? the answer is of course YES. Thats the main purpose of the
standard C library or `stdlib', which simply is an abstraction to all
the assembly (yes ASM, you can't talk to the kernel directly in C)
required to communicate back and forth with your virtual/real hardware.
However we still want to communicate directly with our kernel, isn't there
any other possible ways except for ASM? of course there is other ways
you silly sally, which are still another abstraction over the ASM
interfaces the kernel provides, like IOCTL, ProcFS (Linux only),
NetLink, and System Calls (SysCalls) all these are called IPC, but
I don't like calling them that, so I'll call them
ISC (InterSpace Communication).
2.3. Kernel Space To Kernel Space
Lets imagine that you got so elite to the point that you created two
kernel modules, and you would like to exchange information between
them. The thing you must understand about the Linux kernel is that
it ends up compiled to a single file, everything is shared inside the
kernel, so you declare a certain function it becomes exported to what
is known to be KST (Kernel Symbols Table), this table will contain
all the functions and variables you've exported, so other parts of
the kernel can call them.
As a side note, as I want to impose the kernel space image upon you,
there is no floating point arithmetics in kernel space, just because
its not worth it. That can tell you how pedantic the process of code
getting selected to go inside the main kernel repo is, so don't expect
any kind of functions/libraries you are used to in user space to exist
in kernel space.
2.4. InterProcess Communication
2.4.1. Types Of Communication
There are mainly 8 types of communication, three of which are locking
mechanisms, Signals, Pipes, Sockets, Message Queues, Semaphores,
SpinLocks, Mutexes, and Shared Memory. Each will be described
in the following sub sections, but the general idea is, they
are all are used "not" equally, they all have their different
purposes.
2.4.1.1. Signals
These are one of the oldest methods to build interrupt based
applications, which means that a signal can be sent to a process
or two in case of an event, or the receiving of certain new data.
Interrupt based style is heavily used inside the kernel so you
should understand this type of communication in great depth if
you are planning to dive into kernel space. Also must note that
this type of communication is asynchronous, which means there is
no two parallel ways of communication, its only one way, or live
on one wire as our friends at the electric engineering department
would love to call it.
2.4.1.2. Pipes
This is a unidirectional byte stream way of communication, which
connects the standard output from one process into the standard
input of another process, of course that bridge is made using
files, but not just normal usual files, they are files that are on
a VFS inode which itself points to a physical page within memory.
Must note that these are unnamed pipes, there are also named pipes
which create real files with the only difference that synchronization
must be handled by you, locking, etc. so don't expect the magic of
unnamed pipes, but you can set permissions to FIFOs (they are called
that because they work with the principle of First In First Out, so
what you write first will be read first on the other end), they can
be created simply by the command `mkfifo' (RTFM).
2.4.1.3. Sockets
I'm sure most of you people have stumbled upon that concept before
or read about it somewhere, or even used it. Its simply what makes
today's networks and even the Internet, they are all operate on the
concept of sockets. Figured it out already?, well they are simply
all ports and IP/Host addresses/names, but not necessarily used only
in the case of over the network communication, it can also be used
as IPC, heard of UNIX sockets before? and yes they are different than
TCP/IP sockets. I'm not going into much detail over this here, as this
has literally tons of information over the Internet, which you sir/mam
can google out on your own.
2.4.1.4. Message Queues
You can think of this type of communication as in one-to-many
relationship, which happens if you want to send a message to many
processes, the only difference between it and mailboxes is that
it has restriction on the size of each message, and shares the
same synchronization as mailboxes which is asynchronous, meaning
that the sender and the receiver do not need to interact with the
message queues at the same time. It's also similar in some ways to
Pipes, except that it all happens in memory, no files.
2.4.1.5. Semaphores
These are more of locks than a way to communicate, and happens mostly
over some shared resource either resides in memory or on disk. Simply
they are a location in memory which value can be tested and set by more
than one process, the test/set operations are atomic or uninterruptible
which from a process point of view; once started nothing can stop it.
You can think of them as some variable that get incremented if a
process or thread jumps in the critical region to modify some
critical resource (ex. memory page, file, etc.), and once finished
with that region in question, it decrements the value in the variable
in an atomic fashion.
Semaphores are not the best solution out there, its quite expensive
to lock and unlock a semaphore, it takes literally thousands of CPU
cycles to do so, because of the system calls that had to be made. But
they had their uses, for example if your critical region is supposed
to be just setting an integer value to a variable, then it is a very
bad idea to use semaphores. But if you have an operation like writing
to some file, then its worth it to put some process to sleep and then
wake it up after you finish, which what semaphores does.
2.4.1.6. SpinLocks
SpinLocks on the other hand are the fastest out there, simply because
they are hardware implemented, note that if you are working on a single
core/processor system, SpinLocks are useless unless you have preemptive
kernel, or preemption is compiled into your kernel. SpinLocks are the
fastest simply because they are implemented in hardware, not like other
locking methods which implemented in software. However take extra care
when exactly you use SpinLocks, as their name says it, they do a busy
spin, which means that they keep spinning in a while loop and saving
the time of sleeping the process and waking it up again, so if the
critical section is taking more than a thread quantum, then SpinLocks
are a very bad idea.
2.4.1.7. Mutexes
You can think of Mutexes as hybrids for SpinLocks and Semaphores, which
explains why they are the most used in user space applications. They
do require expensive system calls when locking, but when you do unlocks
it does it without the kernel help, which saves half the time
Semaphores take. Basically if you don't know what you are doing, your
best bet is to use Mutexes, they are widely known and understood, so
you won't be bothered with all the technicalities, but yet again
if you don't want to be bothered, this course is totally not for you
sir/mam, go have some windows lecture instead.
2.4.1.8. Shared Memory
Shared pages of memory are what they are, I can't really think of a
better to describe them except that they just have an id just like
MQueues (Message Queues, duh sherlock), but just simply share memory
pages, not necessarily having the same address for in each process
accessing them, but they do reference the same page, the mechanics
of this part is complex and deep, so I leave them to later on in
course.
2.4.2. Synchronization
We have seen two terms till now, asynchronous and synchronous
operations, the two differ only by one character but in meaning
they differ a lot. Asynchronous operations are operations that
do not necessarily expect answers right away, an example would
be your email, you send an email to a friend, but do you expect
him to answer instantly? no. Synchronous operations on the other
hand are operations that do expect answers instantly, basically
they do block on answers, an example would be talking with your
friend on the phone, when you talk, he listens and responds in
the same conversation context.
2.4.3. Common Problems
Most IPC methods of communication are known to share one big
common problem, which is synchronization, which in the context
of computers and applications should be addressed in terms of
managing access between processes/threads, especially on a system
that has more than one CPU (either virtual or physical).
One of these problems does touch security, like race conditions.
Race conditions happen when two threads or processes race for
a certain operation, like setting or reading a value, when that
happens, it can exploited to corrupt the system memory, or even
gain unauthorized access to the system itself, so must take extra
caution when setting up locks.
Another problem is deadlocks, which happen when a certain process
or thread locks and dies before it unlocks, which keeps either all
other processes/threads waiting on the lock or spinning on the lock
which ultimately results to the termination of the application.
These kind of problems are very hard to debug, so I felt mentioning
them in a separate section to show how important they are, or otherwise
you will end up producing very bad code and some guy with a tie and
a suit knows second to none about computers, yelling at you real loud
asking for the name of the person that taught you these stuff to murder
with a chainsaw.
3. System Application Design
3.1. Strategy
If you grown up to be the strategy nut I am, I'm sure you will be very
good at these stuff, simply because on the fly planning always works,
pre-planning stuff always fail, and you need to have a vision into
things along adapting and inventing different strategies of your own
to be able to see bugs and errors at a glance and be able to mitigate
them right away and know exactly what to modify and what to lave as-is,
its really a very good skill to have, and as every other skill it comes
with effort and training. The only method I found very effective to
train that kind of skill is to look at others code, see and learn how
they done things in their own way and style, try to understand it from
the little tiny pieces, put the pieces together to form the mental
image they had when they first developed that application and see
if it can be improved in anyways.
Bluntly as I always say, try to be more of a hacker that investigate
every single detail and tries to understand the whole of everything,
its not a shame to fail thousands of times, but it is a shame to be
ignorant even about the little tiny things that everybody else
discards, and always remember that knowledge is power.
3.2. Daemons
So what are these little evil daemons, huh? They are simply what
windows people call them, "background processes" or "services",
if you never developed one before its time to build one. The main
difference between a daemon and a process like "top &" is that
the former closes all standard input descriptors (stdin, stdout,
stderr), and forks another process which stays in a conditional
loop till the condition that it is keeping the loop running is
gone (ex. waiting on a SIGTERM signal). Also logging is a major
trait of daemons, most have logs of their own and those that don't
make use of other pre-installed logging systems.
3.3. Logs
Any daemon should have one or more ways to communicate with the system
administrator, if its not logs, what would it be? You can't communicate
thorough standard output because a daemon is never attached to any
ttys or pts's, so it gotta be logging. There are several mechanisms for
logging, either you design your own logging, but will have to rotate
your logs, so you won't end up with one file of gigs of bytes. Rotating
logs isn't hard, you can make use of `logrotate'. Or you can save
yourself all the trouble and make use of `syslog' which is a logging
system that provides a very simple interface that you can make use of.
3.4. Storage
Some daemons needs some way of organized storage, there are several
solutions to fulfill your storage needs, one is to use `SQLite',
which is used by many applications, like APT just to name one, but
`SQLite' is only for not so sophisticated schemes of storage, and
you should only use it in cases that does not require huge data
sets to be stored. If your application requires some heavy duty
DB system, I strongly recommend the usage of `MySQL', its a very
good and a well known DBMS that comes a long with a very well
done API.
3.5. Debugging
Debugging daemons is specifically hard, because it involves threads
so you want to know where exactly a certain bug is, however you need
first to learn how to escape from the fork being done at first that
spawns the background process. `set follow-fork-mode child' this
command shall force GDB to follow a fork child, meaning that once
your daemon forks a background process, GDB starts debugging the
new child, and the parent is simply discarded and let to die.
To know how many threads are running and which one is currently in
context, you issue `info threads' which will display a numbered
list. But what if you wanted to switch to one of these threads? easy
you just issue `thread [threadno]' where [threadno] is the thread
number you got from `info threads'. But I strongly recommend that you
learn GDB from ground up, as it is a one essential tool in development
under UNIX based systems.
4. Case Studies
4.1. Case 1
Develop an application that creates exactly two threads to calculate
parallel Fibonacci sequence starting at any given point in the seq.
ex. of input file ...
04 08
89 21
55 89
13 21
The first column is where the Fibonacci sequence begins and the second
column is where it ends. The results should be outputted to standard
output in the form of each line in the input file corresponds to a line
in standard output.
Also note that the order of the columns, where the Fibonacci sequence
begins and ends is not sorted, so you might find the first column is
the beginning of the sequence and other times the second column is the
where it begins.
4.2. Case 2
Create another application that calculates the Collatz conjecture
series based on the previous application output, only that it
communicates with the previous application over shared memory pages
and for each outputted line of Fibonacci sequence, a Collatz series
has to be generated for each number in that line and outputted to
standard output in the form of a table that each column begins with
the original number from the first application and ends by one (1)
(read Collatz conjecture on Wikipedia).
4.3. Case 3
Create a daemon that forks to the background and closes all standard
descriptors, creates a few threads to pre-calculates a very large
number of Fibonacci and Collatz series, and a client application that
communicates with the daemon over shared memory pages to get some of
these results and display them to standard output.
Note that the amount of results going to be requested from the daemon
must come from the user not hard coded, so your application must ask
for the amount of results before getting any.
4.4. Final Project
With a team, do develop a server that listens on a specific IP/port
which can handle simultaneous connections using forking or threading
if you choose forking, you will have to implement IPC between the
forked processes and the parent process as data in each process
has to be shared across all other processes in a central fashion
as the parent holds all the data in a shared memory page, and all other
forked processes access it from there. If you however decided to go
threading you have the advantage that you won't have to implement IPC
in your system, but would have to design the threads as a worker thread
and a thread pool, the worker thread, waits on connections and once
a request for a connection is presented, a thread is assigned this
connection from the thread pool. Once the connection ends, the thread
gets released of the connection and back to the thread pool as an
available thread again.
This server purpose is to broadcast messages to each and every client
connected to it, but also store last hour of messages in a queue for
offline clients, so when they login to the server they receive all
last hour messages which being exchanged, that means that all messages
that been sent back and forth between clients. This also means that you
will have to develop a client that connects to the server and be able
to send and receive messages from and to the server.
If you want to get fancy, develop a configuration file parser, and make
the listening IP/port put into a file to be read by the server when it
starts. This is not required, but its just a way for me saying, if you
want to get creative its absolutely encouraged.
Good luck :-)
5. Author
5.1. Background
I've started coding since the early age of 10 years old, and once
I started writing my first few lines on MSX-170/MSX-350, I never
looked back, programming and being able to have full control over
a machine has been an addiction of mine for many years gone and many
to come. I've started to dwell in the security field by writing my
first symmetric encryption algorithm by the age of 14, which got me
even more interested in programming but at a totally different level,
all I wanted ever since is to be able to code at the most intimate
level of the machine, and so I have done, I'm now able to code
some of the BIOS, learned Verilog and able to design and write FPGA
solutions.
As for security, lets just say, it became second nature to me and
a passion, I see vulnerabilities in humans let alone code, I can
manipulate about everything from a group of processes to a group
of people. Its all comes down to this, once you discover this
security 7th sense it just becomes like your sense of vision, it just
changes all and every aspect of your life.
5.2. Contact Information
Please visit http://amr-ali.co.cc
6. Thanks
I'd like to thank mother for all the support, encouragement, and love
she always gave me in that direction. (Love ya mommy :-P)
Also would also like to give thanks to the people that effectively
changed my life to the better and being patient all along ...
Gerald M O'Steen - For being the awesome mentor he is and for teaching
me everything he could and being a very very good friend.
Mark LaDoux - For beating me like a dead cow till I matured and learned
the ways of pursuing knowledge and being a good friend.
Love you all guys <31337